California Housing Price Prediction

Welcome to my very first Machine Learning / Data Science Project, I will be sharing the process and updates using blogs. In this Blog I have detailed the Overview of the project and performed the very first step of a Machine Learning / Data Science Project: Data Extraction!!
You can also view this project on Google Collab.
Overview
This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:
2. EDA (Exploratory Data Analysis) and Visualization
Getting a closer Look at obtained Data.
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between indiviual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features
3. Preprocessing
Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Visualization (Box-Plots)
Using IQR
Using Z-Score
Seperating Target and Input Features
Target feature Normalization (Plots and Tests)
Splitting Dataset into train and test sets
Feature Scaling (Feature Transformation)
4. Modeling
Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Linear Regression
Polynomial Regression
Ridge Regression
Random Forests Regressor
Gradient Boosted Regressor
eXtreme Gradient Boosting (XGBoost) Regressor
Support Vector Regressor
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction
5. Deployment
Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)
1. Data Extraction
Importing all Libraries needed for extracting and representing (visualizing) data. #IMPORTING LIBRARIES import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
Importing / Loading the California Housing Price Data from scikit-learn (sklearn) using the parameter as_frame = True, returns the data in the form of Pandas Dataframe #IMPORTING DATA from sklearn.datasets import fetch_california_housing cal_housing_dataset = fetch_california_housing(as_frame = True)
Understanding the given Description of Data and the problem Statement.
Using sklearn to import a dataset we obtain a Bunch object which is similar to a dictonary which contains information about the dataset and the actual data that we can use
we can access the available keys in the Bunch Object using keys() function
#LIST OF KEYS AVAILABLE WITH DATASET BUNCH OBJECT
cal_housing_dataset.keys()
Output: dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR']) We have the following keys available in the Bunch (data) obtained from sklearn
data - It contains data rows, each row corresponding to the 8 input feature values.
target - It contains target data rows; each value corresponds to the average house value in units of 100,000 US Dollars.
frame - Only present when as_frame = True. Pandas Data Frame with data and target.
target_names - Name of the target feature.
feature_names - Array of ordered feature names used in the dataset.
DESCR - Description of the California housing dataset. This is important to understand the meaning of features that will be used to predict the housing prices.
Take a look at different Inputs and details available with dataset We can take a look at the information avaliable in the DESCR key to get a understanding of the data such as what is shape of our dataset and learn what are different features available that we can use for predicting house prices. #DESCRIPTION OF DATASET print(cal_housing_dataset.DESCR) Output: .. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics: **
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
An household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the :func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297
Using the dataset description above we can see that we have 20640 housing data points (records) and each of the housing record contains information about the houses of the block in the form of 8 input features:
MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude and Longitude
This information about the block in which house is located can be used to create a model that can predict what should be the price of a new house with different set of characteristics. We can also seperatly get a list of all the input features and the target feature available. #PREDICTIVE (INPUT) FEATURES AVAILABLE print(cal_housing_dataset.feature_names) Output: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'] #TARGET (OUTPUT) FEATURE print(cal_housing_dataset.target_names) Output: ['MedHouseVal'] the data key of the 'Bunch' object contains the input feature data values for the housing records. #DATA AVAILABLE CORRESPONDING TO INPUT FEATURES FOR EACH HOUSING RECORD print(cal_housing_dataset.data) Output:

We also have the corresponding housing price for each of the records. #RESULTING TARGET / OUTPUT FEATURE VALUES AVAILABLE print(cal_housing_dataset.target) Output:

We can see the complete dataset (input features and targets) using the frame key of our dataset object which stores the complete housing dataset (all records and features) as a Pandas DataFrame because we imported our data using the parameter as_frame = True.
#THE COMPLETE DATASET AVAILABLE FOR USE
cal_housing_dataset.frame
Output:

We can store our cal_housing_dataset.frame in a seperate new pandas variable (as a DataFrame) for easy reference later on. #STORING THE HOUSING DATA IN A SEPERATE PANDAS DATAFRAME VARIABLE dataset = cal_housing_dataset.frame Output:
We can view the top 5 rows of the dataset using .head() method of dataframe. #VIEW TOP 5 ROWS OF DATASET dataset.head() Output:

Thank you for your time!! The following is the file with progress of the project until now.
Did you like my Notebook and my Approach??
Yes, Absolutely
Nice Try 😅
No, can improve a lot.
Comments