NOTE: This was written on my personal GitHub website on March 17, 2019. I'm republishing here, as this is still very useful for personal data science projects.

Introduction

Not all machine learning problems require a prediction. Sometimes, machine learning can be used to discover hidden patterns or data groupings without the need for human intervention. This process is called unsupervised learning. Unsupervised learning analyzes and clusters unlabeled datasets, which makes it easier to create supervised learning models to predict such cluster labels.

This tutorial will go over how to use K-means to create clusters from a dataset.

Problem

The VI-SPDAT (Vulnerability Index - Service Prioritization Decision Assistance Tool) is a survey administered both to individuals and families to determine risk and prioritization when providing assistance to homeless and at-risk of homelessness persons. The goal of Miriam's Kitchen (MK) is to lower the SPDAT scores of the homeless who entered their shelter. By taking in the SPDAT of the person before entering MK (Beginning Score) and their most recent SPDAT (during their stay or before they left the shelter - Ending Score), MK provides us data to visualize how effective it is doing its job in providing assistance to the homeless.

Data Collection / Munging

The excel file is located in this Github repo.

This tutorial shares the same repo as one of my prior tutorial. If you downloaded the SPDAT excel file locally, you'll have to change the variables mk_folder and mk_detailed_spdat_excel_file.

Below is the data frame.

50140 Entered with a beginning SPDAT of 16.0. Its most recent SPDAT was 9.0. Yet, the net SPDAT was -27.0 (which isn't 9-16). The reason being is that between the beginning and most recent SPDAT, 50140's SPDAT was evauluated 4 more times. The second to most recent SPDAT was 36.0, so that's why the net is -27.0.

Looking at detailed notes in the excel file, 50140 had some difficulty adjusting due to external factors (job loss, family matters). But if those external factors weren't included, how would 50140's SPDAT be affected?

We decided to estimate the ending score minus external factors. Taking the beginning score and adding the SPDAT score change. While simplistic, it is also flawed as the score change is due to previous SPDAT changes instead of overall.

Also dropping the score change as it is no longer needed.

Below are the results.

Data Visualizations

To test if the predicted ending score matches with the ending score, we'll plot the KDE's and compare how similar the data are.

Below is the KDE plot.

The predicted score ends up being similar to the actual score. To see if they have a similar relationship, let's plot the scatter plots of the two.

There is some linear correlation between beginning score and ending score predicted. With some outliers, MK seems to do a good job in lowering SPDAT (minus external factors) if beginning SPDAT was high to begin with. MK did an ok job in maintaining SPDAT when beginning SPDAT was low to begin with. Let's see if this holds true when we add back in external factors.

Below are the results.

So there's a lot of variability here. A linear relationship cannot predict this data. We'll have to turn to clustering analysis for this.

K-Means Clustering

Clustering is an unsupervised machine learning technique. There are no defined dependent and independent variables. The goal is to identify and group similar observations. From SPDAT detailed files, changes in SPDAT scores are a result of mental health issues. The goal is to separate the data into different clusters to help further efforts on mental health strategies to the homeless in MK.

The Elbow Method

Below is the plot.

The optimal number of clusters is 4.

Clustering

Below are the results from the scatter plot.

So there are 4 segments to focus on.

Homeless with low beginning SPDAT scores and low ending SPDAT scores (blue)
Homeless with high beginning SPDAT scores and low ending SPDAT scores (green)
Homeless with low beginning SPDAT scores and high ending SPDAT scores (orange)
Homeless with high beginning SPDAT scores and high ending SPDAT scores (red)

The red segment takes in people with serious mental health issues, and the blue takes in people with little or no mental health issues. Those people seem to be doing the same as before, and it would be wise to limit their efforts on them. The orange segment is concerning as there are more points in that segment than in the green segment. There have been a good number of green points where MK helped drop a homeless person's SPDAT by 15-20 points. Analyzing strategies in how MK did that could help reduce the number in the orange segment.

Conclusions

Miriam's Kitchen tracks SPDAT scores at the beginning the patient enters the homeless shelter and at the end of when the patient leaves the homeless shelter. We tried to determine if beginning SPDAT scores and ending SPDAT scores had a linear relationship. We were unable to find such, so we tried to cluster these data points into different groups.

K-means clustering showed some promising insights. A green cluster showed that there are homeless people who benefited from MK. An orange cluster showed that there are homeless people who got worse from MK.

MK could analyze strategies they used on the green cluster, and advocate for more funding on the benefits the kitchen provides. On the flip side, MK could analyze why their strategies failed with the homeless on the orange cluster. They can either tweak their strategies, or recommend these homeless to a different shelter.

Github Code

SPDAT

Thanks for reading! I'm also active on Medium and Linkedin. Click here to view my tech tutorials and career advice on Towards Data Science. Click here if you want to get in touch with me on LinkedIn.