My very first Machine Learning Project - California Housing Price Prediction - 1. Data Extraction.

Aaditya Bansal
Feb 8, 2023
5 min read

Updated: Feb 18, 2023

California Housing Price Prediction

Welcome to my very first Machine Learning / Data Science Project, I will be sharing the process and updates using blogs. In this Blog I have detailed the Overview of the project and performed the very first step of a Machine Learning / Data Science Project: Data Extraction!!

You can also view this project on Google Collab.

Overview

This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:

1. Data Extraction

Import libraries.
Import Dataset from scikit-learn.
Understanding the given Description of Data and the problem Statement
Take a look at different Inputs and details available with dataset.
Storing the obtained dataset into a Pandas Data Frame.

2. EDA (Exploratory Data Analysis) and Visualization

Getting a closer Look at obtained Data.
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between indiviual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features

3. Preprocessing

Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Visualization (Box-Plots)
Using IQR
Using Z-Score
Seperating Target and Input Features
Target feature Normalization (Plots and Tests)
Splitting Dataset into train and test sets
Feature Scaling (Feature Transformation)

4. Modeling

Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Linear Regression
Polynomial Regression
Ridge Regression
Decision Trees Regressor
Random Forests Regressor
Gradient Boosted Regressor
eXtreme Gradient Boosting (XGBoost) Regressor
Support Vector Regressor
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction

5. Deployment

Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)

1. Data Extraction

Importing all Libraries needed for extracting and representing (visualizing) data. #IMPORTING LIBRARIES import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

Importing / Loading the California Housing Price Data from scikit-learn (sklearn) using the parameter as_frame = True, returns the data in the form of Pandas Dataframe #IMPORTING DATA from sklearn.datasets import fetch_california_housing cal_housing_dataset = fetch_california_housing(as_frame = True)

Understanding the given Description of Data and the problem Statement. Using sklearn to import a dataset we obtain a Bunch object which is similar to a dictonary which contains information about the dataset and the actual data that we can use we can access the available keys in the Bunch Object using keys() function #LIST OF KEYS AVAILABLE WITH DATASET BUNCH OBJECT cal_housing_dataset.keys()

Output: dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR']) We have the following keys available in the Bunch (data) obtained from sklearn

data - It contains data rows, each row corresponding to the 8 input feature values.
target - It contains target data rows; each value corresponds to the average house value in units of 100,000 US Dollars.
frame - Only present when as_frame = True. Pandas Data Frame with data and target.
target_names - Name of the target feature.
feature_names - Array of ordered feature names used in the dataset.
DESCR - Description of the California housing dataset. This is important to understand the meaning of features that will be used to predict the housing prices.

Take a look at different Inputs and details available with dataset We can take a look at the information avaliable in the DESCR key to get a understanding of the data such as what is shape of our dataset and learn what are different features available that we can use for predicting house prices. #DESCRIPTION OF DATASET print(cal_housing_dataset.DESCR) Output: .. _california_housing_dataset:

California Housing dataset

--------------------------

**Data Set Characteristics: **

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:

- MedInc median income in block group

- HouseAge median house age in block group

- AveRooms average number of rooms per household

- AveBedrms average number of bedrooms per household

- Population block group population

- AveOccup average number of household members

- Latitude block group latitude

- Longitude block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

An household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the :func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297

Using the dataset description above we can see that we have 20640 housing data points (records) and each of the housing record contains information about the houses of the block in the form of 8 input features:

MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude and Longitude

This information about the block in which house is located can be used to create a model that can predict what should be the price of a new house with different set of characteristics. We can also seperatly get a list of all the input features and the target feature available. #PREDICTIVE (INPUT) FEATURES AVAILABLE print(cal_housing_dataset.feature_names) Output: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'] #TARGET (OUTPUT) FEATURE print(cal_housing_dataset.target_names) Output: ['MedHouseVal'] the data key of the 'Bunch' object contains the input feature data values for the housing records. #DATA AVAILABLE CORRESPONDING TO INPUT FEATURES FOR EACH HOUSING RECORD print(cal_housing_dataset.data) Output:

We also have the corresponding housing price for each of the records. #RESULTING TARGET / OUTPUT FEATURE VALUES AVAILABLE print(cal_housing_dataset.target) Output:

We can see the complete dataset (input features and targets) using the frame key of our dataset object which stores the complete housing dataset (all records and features) as a Pandas DataFrame because we imported our data using the parameter as_frame = True. #THE COMPLETE DATASET AVAILABLE FOR USE cal_housing_dataset.frame

Output:

We can store our cal_housing_dataset.frame in a seperate new pandas variable (as a DataFrame) for easy reference later on. #STORING THE HOUSING DATA IN A SEPERATE PANDAS DATAFRAME VARIABLE dataset = cal_housing_dataset.frame Output:

We can view the top 5 rows of the dataset using .head() method of dataframe. #VIEW TOP 5 ROWS OF DATASET dataset.head() Output: