Assume that you are an entrepreneur and want to launch a new business but do not know where. It is a critical decision because the characteristics of the chosen area directly affect the survival chance of your business at the beginning and its profitability later on.

I will tell you how you can utilize data science techniques to find an answer to this problem by walking through one of my projects.

Neighborhoods with the Highest Chance of Success

Business Problem :

An entrepreneur would like to launch a gym business in central Toronto, Canada, and s/he would like to know which neighborhood would be the best place for such an investment.

Ideas for possible solutions:

The consensus for a solution would be to calculate the customer potential(Target Audience), competition level, and income level of the neighborhood and choose the best one. Another big factor would be the security of the neighborhood. None would like to take risks going to a gym.

The criteria above rely on the simplest business realities. Let me explain;

Customer Potential: No business is sustainable without a demand for its products or services. Therefore, every business should have an idea about the demand in the market. Customer potential is the primary metric measuring the potential demand.

In this case, the demographics of gym subscribers gain attention. Statistics show that the majority of gym subscribers are between 15 – 45 years old.

Competition Level: This is a very straightforward factor, sales of a business would be negatively affected by any type of competitor. If competition is high in a market, it is generally better to avoid launching a product/business in that market. Substitutes should also be taken into consideration which may dry out the demand of the market.

For a gym entrepreneur, the substitutes would be green spaces, parks, and running trails that people can use to do sports instead of going to a gym.

Income Level: Potential demand is not everything if it cannot be realized. Prospects may desire a purchase but cannot afford it, so, it is important to check if people can afford the products/services of the prospective business.

Statistics show that gym subscribers are generally people who earn $70,000 and more. Having statistics about the percentage of people who fits this criterium would be very helpful.

Security Level: This factor can easily be measured by the crime rate data. It is obvious that a neighborhood with a high crime rate is not a logical place to launch a business. Unsafe areas are directly disadvantageous in terms of demand. On the other hand, the chance of having unusual distasteful events could increase costs and reduce profits.

Now, that I highlighted the criteria for a new business, I would like to show how we can use these.

The Plan:

Gather data about the required fields
Clean the data and optimize
Use the data in a machine learning clustering model
Visualize the findings

Gathering data :

Toronto municipality offers an amazing database. This database gives access to a ton of demographic, economic, and sociological records for each neighborhood.

I collected data about the crime rate, income level, green park size, and the number of potential customers using the API of this database.

In addition to these, I collected spatial data for the neighborhoods to present the findings on a map.

This is the link for the Database of Toronto Municipality

On the other hand, the Foursquare application also provides a free API service that helped me detect the locations of gyms in each area. I used this to get an idea about the competition level.

Link for Foursquare API

I gathered the data from these APIs in desired formats by using Python. This is the link for the details of the data collection part of the project.

Data cleaning and optimization:

The data is obtained smoothly from the APIs. However, since they come from different sources I had to convert them into matching formats. This process took most of my time but there are not many exciting things to show.

Machine Learning Modeling:

After getting the data, we need to use an unsupervised machine learning model to make inferences from the data. The name of the model is K-Means clustering, it is the most popular model clustering model in the machine learning field. As its name suggests, it clusters the occurrences (neighborhoods) according to their characteristics. A technique called elbow curve hints about the optimal number of clusters the data needs to be separated.

Elbow curve hints at 4 clusters.

After adjusting the number of clusters, I run the model. This is the visualization of the output;

Cluster 0 and 3 are mixed, but they are not what we are interested in. According to the model, the best neighborhoods for investment are the ones in cluster 2. The neighborhoods in question show the highest level of income level, comparatively low level of competition, and acceptable crime rate.

Here is the link for the details of the model, data, and neighborhoods.

Visualization:

The best neighborhoods are detected and listed. It is also important to show them on a map. I needed to merge the geospatial data and clustering results to do it. The image shown below is an example from the representation created in Tableau. Here, there is an interactive dashboard that gives detailed information about the neighborhoods which may help the decision-makers.

References:

FinancesOnline — I used the insights provided on this website to make decisions in regard to data collection. I was able to create a list of the required data only with the information gathered from this website. The website gave me a good understanding of the features of gym subscribers by giving statistical data.
Medium Post — My work originated from this medium post. I was fascinated by the idea and I wanted to do a similar project. Thanks to the author of this post, I learned a lot through this project.