California Housing Price Prediction - 3. Preprocessing

Aaditya Bansal
Feb 11, 2023
7 min read

Updated: Feb 18, 2023

Housing Price Prediction

Welcome to my very first Machine Learning / Data Science Project.

This post is in continuation of the Part 1 - Data Extraction of this Project and Part-2 EDA and Visualization, please check them out if haven't already.

I will be sharing the process and updates using blogs.

In this Blog Post I have detailed the Overview and focused on a very crucial part of a Machine Learning / Data Science Project: Preprocessing!!

You can also view this project on Google Collab.

Overview

This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:

1. Data Extraction (See Details in Previous Blog)

Import libraries
Import Dataset from scikit-learn
Understanding the given Description of Data and the problem Statement
Take a look at different Inputs and details available with dataset
Storing the obtained dataset into a Pandas Data frame

2. EDA (Exploratory Data Analysis) and Visualization (See Details in Previous Blog)

Getting a closer Look at obtained Data.
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between indiviual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features.

3. Preprocessing

Dealing with Duplicate and Null (Nan) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Separating Target and Input Features
Target feature Normalization (Plots and Tests)
Splitting Dataset into train and test sets
Feature Scaling (Feature Transformation)

4. Modeling

Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Linear Regression
Polynomial Regression
Ridge Regression
Decision Trees Regressor
Random Forests Regressor
Gradient Boosted Regressor
eXtreme Gradient Boosting (XGBoost) Regressor
Support Vector Regressor
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction

5. Deployment

Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)

3. Preprocessing

Now that we have some insights about data, we need to preprocess them for the modeling part. The main steps are:

Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Data Normalization (Plots and Tests)
Feature Scaling (Feature Transformation)
Feature Engineering (Feature Design)

Dealing with Duplicate and Null (NaN) values Many Machine Learning Models cannot work with Missing (NaN) values, so we need to deal with them before training our models. We can easily check how many values are missing in our dataset with respect to each of the input features. #CHECK NUMBER OF MISSING VALUES IN EACH OF THE FEATURES (COLUMNS) dataset.isnull().sum() Output:

There are no missing values in our dataset, as this is a practice dataset it contains no missing values, but real-life datasets can have missing values, so will have to deal with them accordingly.

Dealing with Categorical features (e.g., Dummy coding) Categorical features have data stored in a non-numerical or discrete form, and many machine learning algorithms like linear regression cannot directly work with categorical data, so we need to Convert them into numerical type. We can use two methods : Ordinal Encoding and One-Hot Encoding to do so. Now let's check if we have any catagorical features in our dataset. #CHECK DATA TYPES TO IDENTIFY CATAGORICAL FEATURES dataset.dtypes Output:

All of our data (features) are of float (real number) type, so we don't need to worry about Categorical to numerical Encoding.

Dealing with Outlier values There can be some outlier records present in our dataset which can be valid actual records but they do not represent the general data and are more of an exception than norm, some Machine Learning models can be quite heavily affected by the presence of these outliers, so we can try to deal with them to improve performance. NOTE: We should always remove outliers from data before performing train-test split and before performing normalization (standardization).

We can remove outliers from our dataset using Z-score OR using Interquartile range (IQR)

Outlier = z-score less than -3 or greater than 3. Outliers = Observations > Q3 + 1.5 * IQR or Q1 – 1.5 * IQR

We can use Z-score to remove outliers from Normally distributed features. And use IQR to remove outliers from features with Skewed distribution. We can also use boxplots to visualize if our feature values contain Outlier values. #THE FEATURES WE ARE CONSIDERING FOR OUTLIER REMOVAL list_of_input_features_for_outliers = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']

Detecting Outliers using Visualization (Box Plot) #VISUALIZING FEATURE DISTRIBUTIONS AND IDENTIFING OUTLIERS USING BOX-PLOTS fig, axes = plt.subplots(3, 2, figsize = (10, 10)) sns.boxplot(data = dataset, x = 'MedInc', ax = axes[0, 0]) sns.boxplot(data = dataset, x = 'HouseAge', ax = axes[0, 1]) sns.boxplot(data = dataset, x = 'AveRooms', ax = axes[1, 0]) sns.boxplot(data = dataset, x = 'AveBedrms', ax = axes[1, 1]) sns.boxplot(data = dataset, x = 'Population', ax = axes[2, 0]) sns.boxplot(data = dataset, x = 'AveOccup', ax = axes[2, 1]) Output:

Let's analyze the plots one by one:

'MedInc' - seems to have quite a few outliers.
'HouseAge' - there seems to be no outliers.
'AveRooms' - There seems to 2 clear Outliers and there there a few more.
'AveBedrms' - also have 2 clear Outliers and then there are few outliers more like 'AveRooms'
'Population' - we can clearly spot 2 major outliers and there are more less extreme outliers also present.
'AveOccup' - Most of the data values seems to be very small but there are a few clear outliers.

Using IQR Detecting Outliers using IQR We can also see a count of how many outlier values are there in each feature, there may be also some common records in them so we deal with them together for removal. #FUNCTION TO COUNT THE OUTLIERS IN A PARTICULAR FEATURE def count_outliers_using_IQR(feature_name): # IQR Q1 = np.percentile(dataset[feature_name], 25, interpolation = 'midpoint') Q3 = np.percentile(dataset[feature_name], 75, interpolation = 'midpoint') IQR = Q3 - Q1 # Above Upper bound upper = dataset[feature_name] >= (Q3 + 1.5 * IQR) # Below Lower bound lower = dataset[feature_name] <= (Q1 - 1.5 * IQR) print(feature_name, " : ", len(np.where(upper)[0]) + len(np.where(lower)[0])) print("Number of Outlier values with respect to features : \n") for feature in list_of_input_features_for_outliers: count_outliers_using_IQR(feature) Output:

Removing Outliers using IQR Formula IQR = Quartile3 – Quartile1 upper = Q3 + 1.5 * IQR lower = Q1 – 1.5 * IQR #FUNCTION TO CALCULATE THE MASK THAT MARKS WHICH ELEMENTS ARE OUTSIDE THE LIMITS - OUTLIERS def calc_elements_upper_and_lower_than_IQR(feature_name): Q1 = np.percentile(dataset[feature_name], 25, interpolation = 'midpoint') Q3 = np.percentile(dataset[feature_name], 75, interpolation = 'midpoint') IQR = Q3 - Q1 # Above Upper bound upper_element_mask = dataset[feature_name] >= (Q3 + 1.5 * IQR) # Below Lower bound lower_element_mask = dataset[feature_name] <= (Q1 - 1.5 * IQR) return upper_element_mask, lower_element_mask #CREATING A MASK - MARKING ALL THE ELEMENTS THAT ARE CONSIDERED OUTLIERS i.e. OUT OF LIMITS #LIST CORRESPONDING TO EACH FEATURE list_of_masks_for_outlier_removal = [] for feature in list_of_input_features_for_outliers: x, y = calc_elements_upper_and_lower_than_IQR(feature) list_of_masks_for_outlier_removal.append(x) list_of_masks_for_outlier_removal.append(y) #CREATING A MASK FOR OVERALL DATA RECORDS WITH RESPECT TO ALL THE FEATURES mask_for_outlier_removal_iqr = np.any(list_of_masks_for_outlier_removal, axis = 0) # OUTLIER POSITIONS list_of_records_with_outliers_iqr = np.where(mask_for_outlier_removal_iqr) #NUMBER OF TOTAL OUTLIERS len(list_of_records_with_outliers_iqr[0]) Output: 3800 We have 3800 records that are classified as Outliers using IQR Method. # CREATING A NEW DATASET WITH OUTLIER VALUES REMOVED USING IQR dataset_clean_iqr = dataset.drop(list_of_records_with_outliers_iqr[0]) dataset_clean_iqr.shape

Output: (16840, 9) Now we have a dataset that does not contain outlier values, we can try to see if this impoves the performance and generalize better.

Using Z-Score Formula Zscore = (data_point - mean) / std. deviation # IMPORTING LIBRARY NEEDED FOR Z-SCORE CALCULATION AND DEFINING THRESHOLD VALUE from scipy import stats threshold = 3 # DEFINING THE FUNCTION TO CALULATE THE Z-SCORE VLAUE # Position of the outlier # where (z > threshold) def calc_z_score_mask(feature_name): z_score = np.abs(stats.zscore(dataset[feature_name])) return z_score > threshold #CREATING A MASK - MARKING ALL THE ELEMENTS THAT ARE CONSIDERED OUTLIERS i.e. OUT OF LIMITS #LIST CORRESPONDING TO EACH FEATURE list_of_masks_for_outlier_removal_z = [] for feature in list_of_input_features_for_outliers: feature_mask = calc_z_score_mask(feature) list_of_masks_for_outlier_removal_z.append(feature_mask) #CREATING A MASK FOR OVERALL DATA RECORDS WITH RESPECT TO ALL THE FEATURES mask_for_outlier_removal_z = np.any(list_of_masks_for_outlier_removal_z, axis = 0) # OUTLIER POSITIONS list_of_records_with_outliers_z = np.where(mask_for_outlier_removal_z) #NUMBER OF TOTAL OUTLIERS len(list_of_records_with_outliers_z[0]) Output: 846 We have 846 records that are classified as Outliers using Z-Score Method. # CREATING A NEW DATASET WITH OUTLIER VALUES REMOVED USING Z-SCORE dataset_clean_z = dataset.drop(list_of_records_with_outliers_z[0]) dataset_clean_z.shape

Output: (19794, 9) Now we have a dataset that does not contain outlier values w.r.t Z-Score, we can try to see if this improves the performance and generalize better.

Separating Target and Features we can separate our input features (X) and target / output feature (y) y_target = dataset['MedHouseVal'] X_features = dataset.drop(['MedHouseVal'], axis = 1)

Target feature Normalization In the EDA Phase, we saw that the distribution of our target variable is not a normal distribution but it is skewed which can affect the performance of many learning algorithms. So let's try to tranform our target distribution into a normal one. To do this we use a log transformation. We will use qq-plot to see the transformation effect. #IMPORTING LIBRARIES TO PERFORM NORMALIZATION from scipy.stats import norm import scipy.stats as stats import statsmodels.api as sm # MedHouseVal BEFORE TRANSFORMATION fig, ax = plt.subplots(1,2, figsize= (15,5)) fig.suptitle(" qq-plot & distribution SalePrice ", fontsize= 15) sm.qqplot(y_target, stats.t, distargs=(4,), fit = True, line = "45", ax = ax[0]) sns.distplot(y_target, kde = True, hist = True, fit = norm, ax = ax[1]) plt.show()

Output:

# MedHouseVal after transformation y_target_log = np.log1p(y_target) fig, ax = plt.subplots(1,2, figsize= (15,5)) fig.suptitle("qq-plot & distribution SalePrice ", fontsize= 15) sm.qqplot(y_target_log, stats.t, distargs=(4,), fit = True, line = "45", ax = ax[0]) sns.distplot(y_target_log, kde = True, hist = True, fit = norm, ax = ax[1]) plt.show() Output:

The data distribution of our target variable has more normal distributution than before, (except the part where we are limiting the price to be capped at 500,000 dollars, so more expensive houses are also labeled as 500,000 houses)

Splitting Dataset into train and test sets To evaluate our model and see how well our Model Generalize to new data, we need to split the data into train and test sets. So we can test how our model performs on data never seen before. # TRAIN TEST SPLIT from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size = 0.2, random_state = 1)

Feature Scaling (Feature Transformation) Standard Scaling (Standardize the dataset) We perform the scaling on the input features so that all the features have a comparable range, and the features with larger values don't become the only prominant feature in pridicting the value, and this alos helps the learing algorithms (gradient descent) to run faster. #STANDARDIZE THE DATASET from sklearn.preprocessing import StandardScaler scaler = StandardScaler() #SCALING THE TRAIN DATA - FIT AND TRANSFORM X_train = scaler.fit_transform(X_train) #TRANSFORM (SCALE) TEST DATA USING SAME SCALER FITTED USED TRAIN SET X_test = scaler.transform(X_test) Save our standard scaler to be used later to perform same scaling / transformation on new data, so that our model can make predictions on that. pickle.dump(scaler, open("scaler.pkl", 'wb')) #here wb = write byte

Thank you for your time!! The following is the file with progress of the project until now.

Did you like my Notebook and my approach??

Yes, Absolutely
Nice try 😅
No, can improve a lot.

Aaditya Bansal

California Housing Price Prediction - 3. Preprocessing

Housing Price Prediction

Overview

3. Preprocessing

Recent Posts

Comments