Lambda School Unit II Portfolio Project

Predicting Home Prices and Building Type

Have you ever wondered if you got a good deal on your house? Have you ever tried to predict the housing market for current or future investments? This is something many people try to do, yet few succeed.

Over the past four weeks I’ve been studying machine learning. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. My plan was to use supervised learning methods to predict home prices and home types.

This month I took a deep dive into ensemble methods, these are techniques that create multiple models and then combine them to produce improved results. Some examples of ensemble methods are random forest and gradient boost. Now it was time to take my new knowledge and put my plan into action!

The Data

I was able to acquire data of over 5,000,000 Russian home listings from Kaggle. The dataset has columns listing date, time, latitude, longitude, region, building type, object type, level, levels, rooms, area, kitchen area, and price. This large dataset didn’t have any missing values but definitely had outliers. This data was parsed from real estate search services in Russia from 2018–2021. The Russian real estate market has been in a growth phase for several years, which means that you can still find properties at very attractive prices, but with good chances of increasing their value in the future. Below is the columns key.

Data Wrangling

After looking over our columns and column key it was time to load in our data. Once the data was loaded in I ran an automated exploratory data analysis with pandas-profiling.

The profile report gave me some great insights to my data and what needed to be done with it. First, with over 5,000,000 observations, data trimming was a must to use my models. Next, I noticed that I didn’t need the time column since the date column was what I could use for my index. I also noticed that I had duplicates to drop in my dataset. Knowing what needed to be done I began to write my code.

After fixing all problems stated above my new data set had 704,495 rows and 11 columns. Now it was time to build our models and work with our data!

What Type Of Problem Is This?

Now that our data is ready to be worked with we first must ask what kind of problem are we trying to solve, classification or regression? The price of home is a regression problem, meaning we are trying to predict the output of continuous quantities. Building type is a classification problem, meaning we are trying to predict the output of discrete class labels 0–5 in our case making it a multiclass classification problem.

Establishing a Baseline

After splitting the data frame into target vector and feature matrix we establish a baseline for each problem. This baseline is what we compare our models to. In our building type problem we get an accuracy score for our baseline. Accuracy is the fraction of predictions our model got right, meaning we want as high accuracy as we can get. On the other hand we get a baseline MAE for our price prediction problem. Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon, meaning we want as low of a MAE as possible. Lastly, we will be looking at the R² evaluation metric for our price prediction. R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. This is a 0–1 metric where we want high scores.

Bagging and Boosting

The two most popular ensemble methods are bagging and boosting. In this project I will use both of them to find the best model for each problem. Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is to create several subsets of data from training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees. As a result, we get an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree classifier. An example of a bagging model is a random forest.

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing model and is usually the more accurate model. An example of a boosting model is XGBoost.

Making Models

For the task of price prediction I decided to run multiple models and compare their scores. I used linear regression, ridge regression, random forest regressor, and XGBregressor. After running my models I got the scores below showing how the random forest regressor and XGBregressor are far superior to the linear and ridge regression methods. We can also see a significant drop in our MAE from our original baseline score in these models.

I then switched over to my other task of predicting building type and ran my random forest classifier and XGBclassifier models.

As you can see the accuracy score of both models is significantly higher then our baseline of 0.37. Now you might think the random forest model is the better choice here, but you would be wrong. The random forest model does have a higher validation accuracy but with a .99 training accuracy we can tell the model is overfitting. Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

Model Tuning

Having models up and running for each problem, it was time to take the best models and tune them. Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. This is accomplished by selecting appropriate hyperparameters and testing them. For our price prediction problem we saw that the random forest and XGB models worked the best but the random forest was overfitting. I decided to tune the XGB model to achieve the best metrics possible.

After running a randomized search cv I was able to find the best hyperparameters to use. After adjusting the XGB model hyperparameters and running it we achieved some amazing evaluation metrics in comparison to our previous scores and baseline.

Now back to our building type problem. As stated before the random forest model was overfitting so I decided to tune the XGB model for this multiclass classification problem.

Just like before I used a randomized search cv to find the best hyperparameters and plugged them into our model. Again we were able to achieve our best score yet.

Permutation Importance

Now that we have our models tuned and running for each problem lets look at the permutation importance of each of them. Permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. On the left is the permutation feature importance for our price prediction and on the right is for our building type.

Most Important Features

Below is a list of the most important features to predict both our problems. On the left is the most important features for our price prediction problem and on the right is our building type prediction problem.

Classification Report

As the name would imply classification report is for classification problems only, below is the report for our building type problem. To clear things up 0–5 represent our different building types. Precision tells us what proportion of positive identifications were actually correct. While recall tells us what proportion of actual positives were identified correctly. F1 score combines precision and recall relative to a specific positive class, the F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

Confusion Matrix

Finally, we have come to our last metric the confusion matrix. This is another matric that only works with classification problems so this is representing our building type problem. After reviewing this confusion matrix we can see our model is good at predicting building types and the majority of the building types are 1,2, and 3.

Conclusion

In conclusion, the last four weeks were awesome, I learned a ton about machine learning and predictive models. While the end product might look good I had a ton of small problems throughout this project. These problems helped me learn and gain valuable insights into the machine learning world. Learning and having the ability to predict house prices and building type is the closest thing I have to a superpower. This project has done nothing but excite me about machine learning and AI. I hope to carry this enthusiasm into the next few months and continue to feed my curiosity!