Image uploaded by Simon on Unsplash

NOTE: This was written on my personal GitHub website on Feb 19, 2019. I'm republishing here, as this is still very useful for personal data science projects.

Introduction

Not all business problems need machine learning. Some questions can be answered through exploratory data analysis or statistical analysis. Understanding the problem is crucial to determine whether machine learning is needed. This post will go over a simple tutorial on how to leverage statistics to answer a customer question.

Problem

The DC government has questioned whether Miriam’s Kitchen (MK) is worth the cost. The government would rather spend 20,000 on private housing as compared to 40,000 for MK initiatives. Using Point in Time data (how many homeless people were on the streets or sheltered at a given point of time) for Federal CoC (Continuum of Care), Federal State, and MK, we can compare how MK does as compared to other homeless efforts in DC.

We’ll be using Federal State, Federal CoC, and MK data. For each, we’ll take the PIT for both sheltered homeless and overall homeless from the years 2007-2018. Since there are vastly more homeless in DC State and DC CoC programs than there are in MK, we’ll instead compare the ratio of sheltered homeless PIT and overall homeless PIT for each of those 3 categories.

Load, Clean, and Visualize Data

As part of Exploratory Data Analysis, we'll need to clean the data to make it readable. Below is the python logic that combines MK, Federal CoC, and Federal State data into one pandas data frame.

Federal CoC and Federal State have only one entry for DC State. We hardcoded the row number for each Federal DC CoC and Federal DC State in state_key and coc_key.

The data frame is shown below.

Below are descriptive statistics of the data.

We can visualize the data as a violin plot.

We can see MK PIT has a higher mean, min, and max in the descriptive statistics table. Furthermore, the violin plot shows that MK is having a bigger impact on sheltering homelessness.

Hypothesis Testing

Regardless, this isn’t enough to justify that MK is impacting DC’s Homelessness efforts. MK_PIT is working on a smaller data set than Federal_CoC and Federal_State. We want to use hypothesis testing to see if MK_PIT is making an impact on Federal_CoC and Federal_State. To determine the hypothesis test, we’ll plot the Kernal Density Estimations to determine the distribution.

The distributions show wide tails. We also have 12 elements per PIT data ( < 30 ), three different samples, and quantitative interval data. Based on this, we’ll conduct two 2-sample t-tests: one with Federal State PIT and MK PIT, another with Federal CoC PIT and MK PIT. Confidence interval both at 95% (0.05 significance level - the probability of making a Type I error).

T-Test 1: Federal State and MK

Our null hypothesis for Federal State-MK test is that MK does not impact homeless efforts of Federal State. The results of the t-test are shown below.

Since the p-value is less than our significance level, we reject our null hypothesis. MK strongly impacts Federal State PIT data.

T-Test 2: Federal CoC and MK

Our null hypothesis for Federal CoC-MK test is that MK does not impact homeless efforts of Federal CoC. The results of the t-test are shown below.

Since the p-value is less than our significance level, we reject our null hypothesis. MK strongly impacts Federal CoC PIT data.

Conclusions

With visualizations and hypothesis testing, we have proved that MK has a huge effect on sheltering homeless in DC. It contributes significantly to Federal CoC and Federal State homeless efforts.

Github Code

PIT Analysis

Thanks for reading! I'm also active on Medium and Linkedin. Click here to view my tech tutorials and career advice on Towards Data Science. Click here if you want to get in touch with me on LinkedIn.