Exploring Condominium Rental Market with Web Scraping and EDA
Glean insights into the Singapore property market through with web scraping and EDA in Python
Photo by Mike Enerio on Unsplash
Glean insights into the Singapore property market through data acquisition (with web scraping) and exploratory data analysis in Python
Introduction and Motivation
There has been plenty of analysis on public housing sales (especially the over-analyzed Boston housing dataset). Even for the Singapore context, there are numerous notebooks and datasets that involves HDB public housing.
Hence, I was keen to examine a segment of the property market that is less explored. Instead of public housing, I looked at private properties (specifically condominiums), and instead of sales, I delved into rentals.
In this article, I share about how I acquired data with web scraping, and performed exploratory data analysis to glean insights into the Singapore condo rental market. Let’s go!
Note: This is Part 1 of a two-part series on the analysis of the Singapore condo rental market. Do also check out Part 2 (Predicting Rental Prices with Ensemble Regressors).
Data Acquisition through Web Scraping
Since there is no openly available dataset for the condo market, we will need to start with data acquisition. This is done through web scraping, which is the process of extracting content from a website. The site from which the information was scraped is 99.co (see Disclaimer at bottom of article), and the tools used were:
- Selenium — A powerful open-sourced web-based automation tool for controlling and performing web browser automation
- WebDriver for Chrome — Driver that provides a platform and language-neutral wire protocol as a way for out-of-process programs to remotely instruct the behavior of web browsers.
The source code can be found in the Jupyter notebook, so I shall not go into the details. Instead, I will share some tips on successful web scraping:
- It will be beneficial if you first gain familiarity with HTML/CSS, because web scraping is dependent on effective navigation of website structures. W3Schools offers great tutorials on this.
- To explore the website structure and locate items you want to scrape, click F12 (or Right click on a webpage and click ‘Inspect) and hover over the HTML tags to identify the data nested within them.
- As web scraping involves task automation, you must have a good idea of the sequence of site navigation (e.g. open link -> scroll to bottom-> click Next button). This involves multiple rounds of trial and error.
The data acquisition took place in mid-December 2020, and of the ~11,000 rental listings, 7,317 listings were successfully extracted and stored as Python dictionaries. The attributes extracted are shown below:
Here is an example of a full listing:
Many datasets found online for practice (e.g. Kaggle) are more or less ready for you to run analyses on them. On the contrary, because this project started off with raw data acquisition, a significant amount of time was spent to properly pre-process the data such that it is correctly organized.
Here are some pre-processing steps taken (check out the Data Pre-processing and EDA notebook).
- Merged beds and bed columns into one, and likewise for baths and bath
- Replaced NaN values in district with values from online references
- Removed obvious erroneous outlier values in rental and unit size (sqft)
- Grouped tenure values into categories of Freehold and Non-Freehold
- Converted data types of numerical features from objects to integers
e.g. sqft, rental, travel_time_orchard etc.
- Omitted features with >50% missing values e.g. floor of unit, year built etc.
Something worth a mention is the rental price (rental). When we plot its distribution (histogram with kernel density estimation) and probability plot, we see an obvious right skew.
Running the pandas
.skew() method (which measures distribution asymmetry) returns a high value of 3.497, thereby quantitatively supporting the observation of asymmetry. To manage this, we can apply log transformation, and this is we will get:
The log operation has transformed the skewed data to conform approximately to normality, and skew value is now much lower at 0.696.
Insights from Exploratory Data Analysis (EDA)
- Singapore is divided into 28 districts, and each has its own characteristics and living experience.
- A significant number of rental listings are located at the core central region of Singapore. The top 4 districts (i.e. D09, D10, D01, D11) based on number of listings are all part of the central region, and they make up almost half (48.2%) of all listings.
- As for mean rental price for each district, District 10 (a key part of the Singapore central region) takes the cake. Besides land scarcity, being situated close to or within the city’s center and financial district is one of the reasons why private residential property in this zone is so costly. District 4 is also high up on the list as it houses a range of luxurious rental options at Sentosa, a renowned island resort off the southern coast.
- Given that there is a skew in the rental price, we can also look at median price instead. Even so, the trend remained the same as above.
- To account for different unit sizes, we need to explore price per square foot (psf). By doing so, we see a slightly different trend:
- What we see now is that District 6 (Clarke Quay, City Hall) is the most costly based on psf (previously ranked 10th in mean rental), while the top two districts for mean rental (D04 and D10) have dropped in the rankings. This means that the high mean rental price of units in D04 and D10 seen earlier was driven mostly by their larger unit sizes.
- In terms of number of units actively listed, the top 3 condominiums are The Sail @ Marina Bay (169 listings accounting for 2.3% of total active listings), Artra (119 listings) and Marina One Residences (107 listings).
- Based on average monthly rental price (SG$), the top 3 condominiums are The Nassim ($49,847), Le Nouvel Ardmore ($28,800) and Seven Palms Sentosa Cove ($26,500).
- The box plots reveal differences in average rental price between the 3 lease types (Flexible, ≥ 24 months and <24 months). This is substantiated quantitatively by running the One-way ANOVA test, which returns a F-statistic of 123.019 (p value <0.001). This indicates presence of a statistically significant difference between the means of the 3 groups.
- Performing the Tukey’s Test for Post-Hoc Analysis shows that all pairing combinations of the 3 groups had a significant difference between them. All these suggests that a unit with higher rental per month is associated with a longer lease contract.
- As we might expect, the larger the unit size (based on sqft), the higher the rental price. Also, since more bedrooms (beds) directly leads to larger unit size, we see the same trend for beds as well i.e. sqft and beds are highly correlated.
Travel Time (by Public Transport) to Key Locations
- Regression plots show that units nearer to the city center (i.e. Raffles Place and Orchard Road) have higher rental prices. The trend for Changi Airport (i.e. units further away from Changi Airport have higher rental price) is likely driven by city center proximity, since the airport is relatively far away from the city center.
Number of Amenities
- My initial guess that the number of amenities influences rental price was disproved upon seeing the flat trend from the regression plot:
- This is possibly because it is the types of amenities that matter, not the quantity. Furthermore, it is likely that not all amenities were accurately captured and posted on the site listings. This is evidenced by the highly varied prevalence of the different types of amenities.
- The following table shows the top 10 amenities based on percentage prevalence in condominium listings:
In this article, I covered the use of web scraping to acquire data for analysis, followed by exploratory data analysis in Python to extract interesting insights relating to the Singapore condominium rental market.
You can view the source codes and complete EDA for this project in the GitHub repo here. Do also check out the other interested data projects over at my Medium page, and I look forward to connecting with you on LinkedIn. Cheers!
Note: The original Medium article can be found here
This article is for educational purposes only. Information accessed is based on publicly available content on 99.co. Contents extracted have not been (and will not be) reproduced, republished, uploaded, posted, transmitted or otherwise distributed in any way. Refer to the Terms of Service for details.