Statistical Analysis On Organization Impact
A tutorial on how to solve a business problem without machine learning. Just exploratory data and statistical analysis.
Image uploaded by Simon on Unsplash
NOTE: This was written on my personal GitHub website on Feb 19, 2019. I'm republishing here, as this is still very useful for personal data science projects.
Not all business problems need machine learning. Some questions can be answered through exploratory data analysis or statistical analysis. Understanding the problem is crucial to determine whether machine learning is needed. This post will go over a simple tutorial on how to leverage statistics to answer a customer question.
The DC government has questioned whether Miriam’s Kitchen (MK) is worth the cost. The government would rather spend 20,000 on private housing as compared to 40,000 for MK initiatives. Using Point in Time data (how many homeless people were on the streets or sheltered at a given point of time) for Federal CoC (Continuum of Care), Federal State, and MK, we can compare how MK does as compared to other homeless efforts in DC.
We’ll be using Federal State, Federal CoC, and MK data. For each, we’ll take the PIT for both sheltered homeless and overall homeless from the years 2007-2018. Since there are vastly more homeless in DC State and DC CoC programs than there are in MK, we’ll instead compare the ratio of sheltered homeless PIT and overall homeless PIT for each of those 3 categories.
Load, Clean, and Visualize Data
As part of Exploratory Data Analysis, we'll need to clean the data to make it readable. Below is the python logic that combines MK, Federal CoC, and Federal State data into one pandas data frame.
Federal CoC and Federal State have only one entry for DC State. We hardcoded the row number for each Federal DC CoC and Federal DC State in state_key and coc_key.
The data frame is shown below.
Below are descriptive statistics of the data.
We can visualize the data as a violin plot.
We can see MK PIT has a higher mean, min, and max in the descriptive statistics table. Furthermore, the violin plot shows that MK is having a bigger impact on sheltering homelessness.
Regardless, this isn’t enough to justify that MK is impacting DC’s Homelessness efforts. MK_PIT is working on a smaller data set than Federal_CoC and Federal_State. We want to use hypothesis testing to see if MK_PIT is making an impact on Federal_CoC and Federal_State. To determine the hypothesis test, we’ll plot the Kernal Density Estimations to determine the distribution.
The distributions show wide tails. We also have 12 elements per PIT data ( < 30 ), three different samples, and quantitative interval data. Based on this, we’ll conduct two 2-sample t-tests: one with Federal State PIT and MK PIT, another with Federal CoC PIT and MK PIT. Confidence interval both at 95% (0.05 significance level - the probability of making a Type I error).
T-Test 1: Federal State and MK
Our null hypothesis for Federal State-MK test is that MK does not impact homeless efforts of Federal State. The results of the t-test are shown below.
Since the p-value is less than our significance level, we reject our null hypothesis. MK strongly impacts Federal State PIT data.
T-Test 2: Federal CoC and MK
Our null hypothesis for Federal CoC-MK test is that MK does not impact homeless efforts of Federal CoC. The results of the t-test are shown below.
Since the p-value is less than our significance level, we reject our null hypothesis. MK strongly impacts Federal CoC PIT data.
With visualizations and hypothesis testing, we have proved that MK has a huge effect on sheltering homeless in DC. It contributes significantly to Federal CoC and Federal State homeless efforts.
Lead Machine Learning Engineer
Lead Machine Learning Engineer with experience in Technical Project Management and Data Science in NLP. Technical writer of TowardsDataScience, a popular Medium publication for data science and machine learning. I blog to aspiring data scientists and machine learning engineers on career advice and tutorials to get their feet wet in the field.