Housing Price Prediction
data:image/s3,"s3://crabby-images/df2da/df2da4785055ccf932b939a44a01d5ea42493aab" alt=""
Welcome to my very first Machine Learning / Data Science Project.
This post is in continuation of the Part 1 - Data Extraction of this Project and Part-2 EDA and Visualization, please check them out if haven't already.
I will be sharing the process and updates using blogs.
In this Blog Post I have detailed the Overview and focused on a very crucial part of a Machine Learning / Data Science Project: Preprocessing!!
You can also view this project on Google Collab.
Overview
This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn. We will perform the following steps for successfully creating a model for house price prediction:
1. Data Extraction (See Details in Previous Blog)
Import libraries
Import Dataset from scikit-learn
Understanding the given Description of Data and the problem Statement
Take a look at different Inputs and details available with dataset
Storing the obtained dataset into a Pandas Data frame
2. EDA (Exploratory Data Analysis) and Visualization (See Details in Previous Blog)
Getting a closer Look at obtained Data.
Exploring different Statistics of the Data (Summary and Distributions)
Looking at Correlations (between indiviual features and between Input features and Target)
Geospatial Data / Coordinates - Longitude and Lattitude features.
4. Modeling
Specifying Evaluation Metric R squared (using Cross-Validation)
Model Training - trying multiple models and hyperparameters:
Linear Regression
Polynomial Regression
Ridge Regression
Decision Trees Regressor
Random Forests Regressor
Gradient Boosted Regressor
eXtreme Gradient Boosting (XGBoost) Regressor
Support Vector Regressor
Model Selection (by comparing evaluation metrics)
Learn Feature Importance and Relations
Prediction
5. Deployment
Exporting the trained model to be used for later predictions. (by storing model object as byte file - Pickling)
3. Preprocessing
Now that we have some insights about data, we need to preprocess them for the modeling part. The main steps are:
Dealing with Duplicate and Null (NaN) values
Dealing with Categorical features (e.g. Dummy coding)
Dealing with Outlier values
Data Normalization (Plots and Tests)
Feature Scaling (Feature Transformation)
Feature Engineering (Feature Design)
Dealing with Duplicate and Null (NaN) values Many Machine Learning Models cannot work with Missing (NaN) values, so we need to deal with them before training our models. We can easily check how many values are missing in our dataset with respect to each of the input features. #CHECK NUMBER OF MISSING VALUES IN EACH OF THE FEATURES (COLUMNS) dataset.isnull().sum() Output:
data:image/s3,"s3://crabby-images/8c0c5/8c0c5cb68ea062f0a27a326ffed0226ff6404284" alt=""
There are no missing values in our dataset, as this is a practice dataset it contains no missing values, but real-life datasets can have missing values, so will have to deal with them accordingly.
Dealing with Categorical features (e.g., Dummy coding) Categorical features have data stored in a non-numerical or discrete form, and many machine learning algorithms like linear regression cannot directly work with categorical data, so we need to Convert them into numerical type. We can use two methods : Ordinal Encoding and One-Hot Encoding to do so. Now let's check if we have any catagorical features in our dataset. #CHECK DATA TYPES TO IDENTIFY CATAGORICAL FEATURES dataset.dtypes Output:
data:image/s3,"s3://crabby-images/5a118/5a118d81fa0e61e967acf847c5447faa0fba5b00" alt=""
All of our data (features) are of float (real number) type, so we don't need to worry about Categorical to numerical Encoding.
Dealing with Outlier values
There can be some outlier records present in our dataset which can be valid actual records but they do not represent the general data and are more of an exception than norm, some Machine Learning models can be quite heavily affected by the presence of these outliers, so we can try to deal with them to improve performance.
NOTE: We should always remove outliers from data before performing train-test split and before performing normalization (standardization).
We can remove outliers from our dataset using Z-score OR using Interquartile range (IQR)
Outlier = z-score less than -3 or greater than 3. Outliers = Observations > Q3 + 1.5 * IQR or Q1 – 1.5 * IQR
We can use Z-score to remove outliers from Normally distributed features.
And use IQR to remove outliers from features with Skewed distribution.
We can also use boxplots to visualize if our feature values contain Outlier values.
#THE FEATURES WE ARE CONSIDERING FOR OUTLIER REMOVAL
list_of_input_features_for_outliers = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
Detecting Outliers using Visualization (Box Plot) #VISUALIZING FEATURE DISTRIBUTIONS AND IDENTIFING OUTLIERS USING BOX-PLOTS fig, axes = plt.subplots(3, 2, figsize = (10, 10)) sns.boxplot(data = dataset, x = 'MedInc', ax = axes[0, 0]) sns.boxplot(data = dataset, x = 'HouseAge', ax = axes[0, 1]) sns.boxplot(data = dataset, x = 'AveRooms', ax = axes[1, 0]) sns.boxplot(data = dataset, x = 'AveBedrms', ax = axes[1, 1]) sns.boxplot(data = dataset, x = 'Population', ax = axes[2, 0]) sns.boxplot(data = dataset, x = 'AveOccup', ax = axes[2, 1]) Output:
data:image/s3,"s3://crabby-images/53c43/53c43d04bf6cea9dd8ef18d9efa806e75bd8412e" alt=""
Let's analyze the plots one by one:
'MedInc' - seems to have quite a few outliers.
'HouseAge' - there seems to be no outliers.
'AveRooms' - There seems to 2 clear Outliers and there there a few more.
'AveBedrms' - also have 2 clear Outliers and then there are few outliers more like 'AveRooms'
'Population' - we can clearly spot 2 major outliers and there are more less extreme outliers also present.
'AveOccup' - Most of the data values seems to be very small but there are a few clear outliers.
Using IQR Detecting Outliers using IQR We can also see a count of how many outlier values are there in each feature, there may be also some common records in them so we deal with them together for removal. #FUNCTION TO COUNT THE OUTLIERS IN A PARTICULAR FEATURE def count_outliers_using_IQR(feature_name): # IQR Q1 = np.percentile(dataset[feature_name], 25, interpolation = 'midpoint') Q3 = np.percentile(dataset[feature_name], 75, interpolation = 'midpoint') IQR = Q3 - Q1 # Above Upper bound upper = dataset[feature_name] >= (Q3 + 1.5 * IQR) # Below Lower bound lower = dataset[feature_name] <= (Q1 - 1.5 * IQR) print(feature_name, " : ", len(np.where(upper)[0]) + len(np.where(lower)[0])) print("Number of Outlier values with respect to features : \n") for feature in list_of_input_features_for_outliers: count_outliers_using_IQR(feature) Output:
data:image/s3,"s3://crabby-images/6748f/6748f5df4f2ee63239063e9f6e325d42035cbf0d" alt=""
Removing Outliers using IQR
Formula
IQR = Quartile3 – Quartile1
upper = Q3 + 1.5 * IQR
lower = Q1 – 1.5 * IQR
#FUNCTION TO CALCULATE THE MASK THAT MARKS WHICH ELEMENTS ARE OUTSIDE THE LIMITS - OUTLIERS
def calc_elements_upper_and_lower_than_IQR(feature_name):
Q1 = np.percentile(dataset[feature_name], 25, interpolation = 'midpoint')
Q3 = np.percentile(dataset[feature_name], 75, interpolation = 'midpoint')
IQR = Q3 - Q1
# Above Upper bound
upper_element_mask = dataset[feature_name] >= (Q3 + 1.5 * IQR)
# Below Lower bound
lower_element_mask = dataset[feature_name] <= (Q1 - 1.5 * IQR)
return upper_element_mask, lower_element_mask
#CREATING A MASK - MARKING ALL THE ELEMENTS THAT ARE CONSIDERED OUTLIERS i.e. OUT OF LIMITS
#LIST CORRESPONDING TO EACH FEATURE
list_of_masks_for_outlier_removal = []
for feature in list_of_input_features_for_outliers:
x, y = calc_elements_upper_and_lower_than_IQR(feature)
list_of_masks_for_outlier_removal.append(x)
list_of_masks_for_outlier_removal.append(y)
#CREATING A MASK FOR OVERALL DATA RECORDS WITH RESPECT TO ALL THE FEATURES
mask_for_outlier_removal_iqr = np.any(list_of_masks_for_outlier_removal, axis = 0)
# OUTLIER POSITIONS
list_of_records_with_outliers_iqr = np.where(mask_for_outlier_removal_iqr)
#NUMBER OF TOTAL OUTLIERS
len(list_of_records_with_outliers_iqr[0])
Output:
3800
We have 3800 records that are classified as Outliers using IQR Method.
# CREATING A NEW DATASET WITH OUTLIER VALUES REMOVED USING IQR
dataset_clean_iqr = dataset.drop(list_of_records_with_outliers_iqr[0])
dataset_clean_iqr.shape
Output:
(16840, 9)
Now we have a dataset that does not contain outlier values, we can try to see if this impoves the performance and generalize better.
Using Z-Score
Formula
Zscore = (data_point - mean) / std. deviation
# IMPORTING LIBRARY NEEDED FOR Z-SCORE CALCULATION AND DEFINING THRESHOLD VALUE
from scipy import stats
threshold = 3
# DEFINING THE FUNCTION TO CALULATE THE Z-SCORE VLAUE
# Position of the outlier
# where (z > threshold)
def calc_z_score_mask(feature_name):
z_score = np.abs(stats.zscore(dataset[feature_name]))
return z_score > threshold
#CREATING A MASK - MARKING ALL THE ELEMENTS THAT ARE CONSIDERED OUTLIERS i.e. OUT OF LIMITS
#LIST CORRESPONDING TO EACH FEATURE
list_of_masks_for_outlier_removal_z = []
for feature in list_of_input_features_for_outliers:
feature_mask = calc_z_score_mask(feature)
list_of_masks_for_outlier_removal_z.append(feature_mask)
#CREATING A MASK FOR OVERALL DATA RECORDS WITH RESPECT TO ALL THE FEATURES
mask_for_outlier_removal_z = np.any(list_of_masks_for_outlier_removal_z, axis = 0)
# OUTLIER POSITIONS
list_of_records_with_outliers_z = np.where(mask_for_outlier_removal_z)
#NUMBER OF TOTAL OUTLIERS
len(list_of_records_with_outliers_z[0])
Output:
846
We have 846 records that are classified as Outliers using Z-Score Method.
# CREATING A NEW DATASET WITH OUTLIER VALUES REMOVED USING Z-SCORE
dataset_clean_z = dataset.drop(list_of_records_with_outliers_z[0])
dataset_clean_z.shape
Output: (19794, 9) Now we have a dataset that does not contain outlier values w.r.t Z-Score, we can try to see if this improves the performance and generalize better.
Separating Target and Features we can separate our input features (X) and target / output feature (y) y_target = dataset['MedHouseVal'] X_features = dataset.drop(['MedHouseVal'], axis = 1)
Target feature Normalization In the EDA Phase, we saw that the distribution of our target variable is not a normal distribution but it is skewed which can affect the performance of many learning algorithms. So let's try to tranform our target distribution into a normal one. To do this we use a log transformation. We will use qq-plot to see the transformation effect. #IMPORTING LIBRARIES TO PERFORM NORMALIZATION from scipy.stats import norm import scipy.stats as stats import statsmodels.api as sm # MedHouseVal BEFORE TRANSFORMATION fig, ax = plt.subplots(1,2, figsize= (15,5)) fig.suptitle(" qq-plot & distribution SalePrice ", fontsize= 15) sm.qqplot(y_target, stats.t, distargs=(4,), fit = True, line = "45", ax = ax[0]) sns.distplot(y_target, kde = True, hist = True, fit = norm, ax = ax[1]) plt.show()
Output:
data:image/s3,"s3://crabby-images/1b614/1b614fdc6ecff60f5309ed1cae762c1c7f82f410" alt=""
# MedHouseVal after transformation y_target_log = np.log1p(y_target) fig, ax = plt.subplots(1,2, figsize= (15,5)) fig.suptitle("qq-plot & distribution SalePrice ", fontsize= 15) sm.qqplot(y_target_log, stats.t, distargs=(4,), fit = True, line = "45", ax = ax[0]) sns.distplot(y_target_log, kde = True, hist = True, fit = norm, ax = ax[1]) plt.show() Output:
data:image/s3,"s3://crabby-images/9c83d/9c83de502f0e08029a445a02ee1d1f40c9087eb1" alt=""
The data distribution of our target variable has more normal distributution than before, (except the part where we are limiting the price to be capped at 500,000 dollars, so more expensive houses are also labeled as 500,000 houses)
Splitting Dataset into train and test sets To evaluate our model and see how well our Model Generalize to new data, we need to split the data into train and test sets. So we can test how our model performs on data never seen before. # TRAIN TEST SPLIT from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size = 0.2, random_state = 1)
Feature Scaling (Feature Transformation) Standard Scaling (Standardize the dataset) We perform the scaling on the input features so that all the features have a comparable range, and the features with larger values don't become the only prominant feature in pridicting the value, and this alos helps the learing algorithms (gradient descent) to run faster. #STANDARDIZE THE DATASET from sklearn.preprocessing import StandardScaler scaler = StandardScaler() #SCALING THE TRAIN DATA - FIT AND TRANSFORM X_train = scaler.fit_transform(X_train) #TRANSFORM (SCALE) TEST DATA USING SAME SCALER FITTED USED TRAIN SET X_test = scaler.transform(X_test) Save our standard scaler to be used later to perform same scaling / transformation on new data, so that our model can make predictions on that. pickle.dump(scaler, open("scaler.pkl", 'wb')) #here wb = write byte
Thank you for your time!! The following is the file with progress of the project until now.
Did you like my Notebook and my approach??
Yes, Absolutely
Nice try 😅
No, can improve a lot.
Comments