cft

Measuring Spread in Data: Why and How?

Variance help us to understand how spread out our data are from one another.


user

Utsav Chatterjee

3 years ago | 7 min read

In the world of data science, some of the most important decisions regarding analyses are made while performing exploratory data analysis on data-sets. While understanding the concepts of Mean, Median and Mode help analysts get started with the basic structure of the data set, these are just the measures of central tendency and don’t provide an overview of the entire data set. Understanding Range, Interquartile Range (IQR), Standard Deviation and Variance help us to understand how spread out our data are from one another.

This article focuses on:

  • Basics of the most commonly used measures of spread while analyzing data,
  • How to calculate these using Python

When we discuss measures of spread, we are considering numeric values that are associated with how far our points are from one another.

Common measures of spread include:

  • Range
  • Interquartile Range (IQR)
  • Standard Deviation
  • Variance

It is easiest to understand the spread of our data visually and the most common visual for quantitative data is the Histogram. To understand how histograms are constructed, consider we have the following data set.

11, 12, 12, 14, 15, 17, 18, 19, 22, 25

First, we need to bin our data. It is completely up to the histogram creator to choose how the binning occurs. In this particular case, can choose our bins as 11–14, 15–18, 19–22 and 23–26. Because the first 4 values are between 11 and 14, they go into the first bin. Similarly, 3 values lie between 15 and 18, so they go into the second bin, and so on.

The number of values in each bin, determine the height of each histogram bar. Changing the bins will affect the visual accordingly. In most cases, software/tools will choose the appropriate bins for us. Our histogram, in this case, will look similar to the figure below.

We can create the above histogram using python as well using the matplotlib library:

# Importing Libraries
import matplotlib.pyplot as plt# Storing data in a list
data = [11, 12, 12, 14, 15, 17, 18, 19, 22, 25]# Setting our bins in the form of a list
bins = [11, 15, 19, 23, 26]# Plotting a very simple histogram and setting edge color to white
histogram = plt.hist(data, bins=bins, edgecolor='white')# Viewing histogram
print(histogram)

Histogram created using matplotlib library on Python

Consider the two histograms below, comparing the number of cars I saw on weekdays and weekends, passing by a cafe. If you notice closely, the tallest bins for both weekdays and weekends are associated with 13 cars. This means that the number of cars I expect to see is the same on weekdays and weekends. Also, the measures of center, in this case, would be very similar — both have a mean, median and mode of about 13 cars. So then, what is different about these two distributions? They look different from each other in the histograms!

Comparison of the number of cars seen on weekdays and weekends.

The difference is how spread out the data are for each group. You can see that the number of cars I see on weekdays ranges from 10–16, while on weekends, it ranges from 6 to 18.

One of the most common ways to measure the spread of our data is to calculate the Five Number Summary, which consists of:

  1. Minimum: The smallest number in the dataset.
  2. Q1 (First Quartile): The value such that 25% of the data falls below.
  3. Q2 (Second Quartile): The value such that 50% of the data falls below, i.e., Median
  4. Q3 (Third Quartile): The value such that 75% of the data falls below.
  5. Maximum: The largest value in the dataset.

The 5 Number Summary gives us values for calculating the range and interquartile range.

Consider the following data-set:

5, 8, 3, 2, 1, 3, 10

To calculate the Five Number Summary, the first thing we need to do is order our values, which gives us

1, 2, 3, 3, 5, 8, 10

Once ordered, the minimum and maximum values are easy to identify. As we know, the median is the middle value in our dataset. We also call this Q2 or the second quartile because 50% of the data falls below this value. The remaining two values left to be calculated are Q1 and Q3. These values can be thought of as the medians of the data on either side of Q2.

So in this case, as the median is 3, the median of values to the left of Q2 will give us the value of Q1 (2) and the median of values to the right of Q2 will give us the value of Q3 (8).

If the data-set has an even number of values, the value of Q2 (median), will be the mean of the middle 2 values. The value of Q1 will be the median of all values to the left of calculated Q2 and the value of Q3 will be the median of all values to the right of calculated Q2.

Once the Five Number Summary values have been computed, finding the Range and Interquartile Range is easy.

Range = Maximum — Minimum = 10–1 = 9
Interquartile Range = Q3 — Q1 = 8–2 = 6

The above calculations can be done using Python as shown below:

# Importing libraries
import numpyinput_data = [2,1,3,3,5,10,8,9]# Calculating the Minimum Value from the data
minimum = min(input_data)# Calculating the Maximum Value from the data
maximum = max(input_data)# Calculating the median using numpy function and saving the value in a variable called median
median = numpy.median(input_data)# Calculating Q1 and Q3
# 1. Sorting the input data list
# 2. Storing the length of the input data list into a variable
# 3. Finding the location of the median in the list to split data into two lists to calculate Q1 and Q2 (for even or odd number of elements)
# 4. Calculate Q1 and Q3sorted_input_list = sorted(input_data)
lenth_sorted_input_list = len(sorted_input_list)if lenth_sorted_input_list%2 != 0:
index_of_median = int((lenth_sorted_input_list+1)/2) - 1
data_for_q1 = sorted_input_list[:index_of_median]
data_for_q3 = sorted_input_list[index_of_median+1:]
else:
data_for_q1 = sorted_input_list[:int((lenth_sorted_input_list/2))]
data_for_q3 = sorted_input_list[int(lenth_sorted_input_list/2):]q1 = numpy.median(first_half_data)
q3 = numpy.median(second_half_data)# Printing final outputs
print("Minimum = " + str(minimum))
print("Maximum = " + str(maximum))
print("Median = " + str(median))
print("Q1 = " + str(q1))
print("Q3 = " + str(q3))
print("IQR = " + str(q3 - q1))

Output for Python code

The most common way that professionals measure the spread of a data-set with a single value is with the Standard Deviation or Variance.
The Standard Deviation tells us on average how far every data point is from the mean of the points.

Imagine we wanted to know how far students were located from their school. One student might be 15 km, another 35km, another only 1 km and another might be living 60 km from the school. We could aggregate all of these distances together to show that the average distance (mean) between students and the school is 27.75 km.

Now, if we wanted to know how the distance to school varies from one student to another, we could use the Five Number Summary as a description. However, if we wanted just one number to talk about the spread, we would choose the Standard Deviation. So in this case, Student 1 is about 13 km closer to the school than the average, while Student 2 is about 8 km farther from the school than the average. The Standard Deviation is how far, on average, these students are located from the mean distance.

This gives us a conceptual idea of what we are trying to measure using standard deviation.

To understand how to calculate the standard deviation, let us consider the following data-set. It just has 4 elements:

10, 14, 10, 6

Step 1: Calculate the Mean

Mean = (10 + 14 + 10 + 6) / 4 = 40/4 = 10

Step 2: Calculate the distance of each observation from the calculated mean

10–10 = 0
14–10 = 4
10–10 = 0
6–10 = -4

Step 3: Two of the observations are equal to the mean, so the distance for these is 0. For the other two observations, one of the values is 4 larger (14) and the other is 4 smaller (6). Now, if we were to find the average of these distances, we would get the value 0, which is not a good measure of spread. This could lead to confusion, as zero could suggest that all the values are the same, or there is no spread.

(0 + 4 + 0 + (-4))/4 = 0

So instead, we make all the values positive by squaring them all.

(10–10)² = 0
(14–10)² = 16
(10–10)² = 0
(6–10)² = 16

Step 4: The average of these values will give us the average squared distance of each observation from the mean, also known as the Variance.

Variance = (0 + 16 + 0 + 16)/4 = 32/4

Step 5: However, this is an average of values which we only squared to get positive values in the first place. So, to get our Standard Deviation, we take the square root of this ending value.

Standard Deviation = √8 = 2.83

This can be done in Python as follows:

# Importing libraries
import numpy# Storing input data into a list
input_data = [10, 14, 10, 6]# Computing variance
print("Variance = ", np.var(input_data))# Computing standard deviation
print("Standard Deviation = ", np.std(input_data))

Summary

  • The Five Number Summary consists of Minimum, Q1 (First Quartile), Q2 (Second Quartile), Q3 (Third Quartile) and Maximum.
  • The Variance is used to compare the spread of two different groups. A set of data with higher variance is more spread out than a dataset with a lower variance.
  • The Standard Deviation is used all the time to get a single number to compare the spread of two data sets. Having this single value also simplifies the amount of information we need to consume. For example, the standard deviation is associated with analyzing risk in the finance industry, in determining the significance of drugs in medical studies, and measures the error of our results for predicting anything from the amount of rainfall we can expect tomorrow to your predicted commute time
This article was originally published by Utsav chatterjee on medium.

Upvote


user
Created by

Utsav Chatterjee


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles