California Housing Price Prediction Project - 2. EDA and Visualization

Aaditya Bansal
Feb 10, 2023
5 min read

Updated: Feb 18, 2023

Housing Price Prediction

Welcome to my very first Machine Learning / Data Science Project.

This post is in continuation of the Part 1 - Data Extraction of this Project, please check it out of haven't already.

I will be sharing the process and updates using blogs.

In this Blog Post I have detailed the Overview and focused on a very crucial part of a Machine Learning / Data Science Project: EDA (Exploratory Data Analysis) and Visualization!!

You can also view this project on Google Collab.

Overview

This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:

1. Data Extraction (See Details in Previous Blog)

Import libraries.
Import Dataset from scikit-learn.
Understanding the given Description of Data and the problem Statement.
Take a look at different Inputs and details available with dataset.
Storing the obtained dataset into a Pandas Data frame.

2. EDA (Exploratory Data Analysis) and Visualization

Getting a closer Look at obtained Data.
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between individual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features.

3. Preprocessing

Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Visualization (Box-Plots)
Using IQR
Using Z-Score
Seperating Target and Input Features
Target feature Normalization (Plots and Tests)
Splitting Dataset into train and test sets
Feature Scaling (Feature Transformation)

4. Modeling

Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Linear Regression
Polynomial Regression
Ridge Regression
Decision Trees Regressor
Random Forests Regressor
Gradient Boosted Regressor
eXtreme Gradient Boosting (XGBoost) Regressor
Support Vector Regressor
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction

5. Deployment

Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)

2. EDA (Exploratory Data Analysis) and Visualization

Before working with any kind of data it is important to understand it. A crucial step to achive this is the Exploratory Data Analysis (EDA): a combination of visualizations and statistical analysis (univariate, bivariate and multivariate) that helps us to better understand the data we are working with and to gain insight into their relationships.

So, let's explore our target variable and how the other features influence it.

Taking a closer Look at obtained Data. We can see the overall details and information about the dataset obtained and individual feature columns using the Pandas functions:

info() - to give an overview of type of data available.
describe() - to get a statistical summary of available features.

#USING info() TO GET OVERVIEW OF INFORMATION / STRUCTURE OF FEATURE DATA dataset.info() Output:

From the quick summary (overview) of dataset above - we can see that all of our features are numerical (real / floating) type and none of the features have any missing values in dataset.

Exploring different Statistics of the Data (Summary and Distributions)

Among our features we also have Latitude and Longitude (Geospatial features - coordinates) which we can try to specially analyze later.

So, right now we will focus on the rest of the features. #LIST OF FEATURES IN DATASET EXCLUDING Lattitude AND Longitude features_for_EDA = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "MedHouseVal"]

#SUMMARIZING STATISTICS OF THE DATASET dataset[features_for_EDA].describe()

Output:

The above statistical summary of dataset gives us insights about the distribution of individual features (how feature values vary in the dataset) We can better understand the feature distributions using graphs / plots. Visualizing the data distribution of the features. #CREATING A DISTRIBUTION PLOT FOR TARGET FEATURE sns.displot(dataset['MedHouseVal'], kde = True) plt.title('MedHouseVal Distribution') plt.xlabel("Median house value in ($100,000)") plt.legend(['MedHouseVal Distribution'], loc = 'best') plt.show()

Ouput:

We can clearly see how the distribution does not seem to be normal, but highly right-skewed. If we use Models like Linear Regressor to make learn for skewed data and then make predictions it will not be able to capture the correct relations in data to make accurate predictions.

Despite that, let's leave it like that for now, we'll deal with that later in the notebook.

We can also take a brief look at the distribution of Input features available using plots.

#VISUALIZING THE DISTRIBUTION IN SELECTED FEATURES EXLUDING MedHouseValues for feature in features_for_EDA[:-1]: sns.displot(dataset[feature], kde = True)

Output:

As we can see from the plots above most of the input features (except HouseAge) also contain highly skewed data, which can affect our model's performance as it might focus too much on the outlier values and might not be able to learn the general relation and patterns in data.

So, we can try to deal with them and make our data distribution closer to normal distribution which can help in some cases to make more accurate predictions (and also generalize better).

Looking at Correlations (between individual features and between Input features and Target) The correlation matrix is the best way to see all the numerical correlation between features. Let's see which are the feature that correlate most with our target variable.

We can also check to see the correlations between the features (if two features are highly corelated we can just use one of them to convey the information needed and we can remove another - feature selection) We can use the corr() function of Pandas dataframe which returns the Pearson correlation between columns (features) by default.

#corr() - CORRELATION BETWEEN COLUMNS (FEATURES) - DEFAULT = PEARSON'S COEFFICIENT #CORRELATION MATRIX dataset[features_for_EDA].corr()

Ouput:

Above we get the correlation matrix between features which tells the correlation using Pearson's coefficient.

We can better understand this using heatmap visualization.

#HEATMAP TO VISUALIZE CORRELATION MATRIX sns.heatmap(dataset[features_for_EDA].corr(), annot = True)

Output:

Now that we know which feature (MedInc) correlates most with our target variable we can investigate it more in depth.

#MedInc - MedHouseVal [Pearson = 0.69] plt.figure(figsize=(12, 8)) plt.scatter(data = dataset, x = "MedInc", y = "MedHouseVal") plt.xlabel("MedInc") plt.ylabel("MedHouseVal")

Output:

(There exists a positive correlation) - The general pattern that we can see is that as the median income in block group (MedInc) increases the Median Price of house (MedHouseVal) increases.

Geospacial Data (Coordinates) - Longitude and Lattitude features.

Now we can also take a look at how Geospatial Data (coordinates) features (Longitude and Lattitude) can help us predict the Housing prices.

#PLOT THE COORDINATES TO VISUALIZE A MAP AND HOUSING PRICES AT DIFFERENT LOCATIONS

sns.scatterplot(data = dataset, x = "Longitude", y = "Latitude", size = "MedHouseVal", hue = "MedHouseVal", palette = "viridis", alpha=0.5) plt.legend(title = "MedHouseVal", bbox_to_anchor = (1.05, 0.95), loc = "upper left") plt.title("Median house value depending on\n the spatial location")

Output:

From the map above, we can see that the location of the House also gives us useful insight as to what could be the Median price of House, Houses of similar price ranges for clusters together on the map and we can use this to get an estimate price based on location.

Thank you for your time!! The following is the file with progress of the project until now.