Storytelling with Histograms, Tips & Extensions

A histogram is a graphical representation of the distribution of a dataset.


Darío Weitz

3 years ago | 7 min read

Image by Patrick Tomasso from Unsplash
Image by Patrick Tomasso from Unsplash

Why: a histogram is a graphical representation of the distribution of a dataset. Although its appearance is similar to that of a standard bar graph, instead of making comparisons between different items or categories or showing trends over time, a histogram is a plot that lets you show the underlying frequency distribution or the probability distribution of a single continuous numerical variable.

Let me clarify that a probability distribution indicates all the possible values ​​that a certain random variable can take plus a summary of probabilities for those values. Also to put it simply, a continuous numerical variable is the one that can take on an unlimited number of values within a range or interval. For example height, weight, age, temperature.

How: histograms are two-dimensional plots with two axes; the vertical axis is a frequency axis while the horizontal axis is divided into a range of numeric values (intervals or bins) or time intervals.

The frequency of each bin is shown by the area of vertical rectangular bars. Each bar covers a range of continuous numeric values of the variable under study. The vertical axis shows frequency values derived from counts for each bin.

The midpoint value is the one that gives the name to the interval. When a numerical value corresponds exactly to one of the boundaries of the interval, it will be assigned to the left or right interval according to the default setting of the visualization tool. Some tools have the possibility to modify this default setting to accommodate it to the preferences or needs of the users.

Histograms sometimes have bars of unequal width. However, it is usual to plot them with the same width in order to represent equal ranges of data for each interval.

As a counterexample the following case can be indicated: collect data from individuals in a population, split the data between bins of 10-year age ranges but accumulate in a single interval data from people over 75 years old. When the binwidth is the same for all intervals, it is equivalent to replace the bar area with the bar length.

Schematic Diagram:

Histogram created with Matplotlib
Histogram created with Matplotlib

Regarding the nature of the message, a histogram shows the frequ

ency distribution that is used (or should be used) to represent the probability distribution of a single continuous quantitative variable. Conceptually, it is not the height but the area of ​​the rectangle which is proportional to the frequency of each interval, which in turn is related to the probability of each range of values ​​in which the continuous variable is divided.

Remember that a frequency distribution is a representation that displays the frequency (how many times) of the occurrence of the values of a variable in a given dataset.

To illustrate the idea: a medium-size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the pattern of the distribution. The overall shape of the distribution is skewed to the right with a clear mode of around $25.

Also, it has a second smaller peak around the $50-55 interval. Even though the majority of clients spend an average of $25 per purchase, there is a second minor group of clients spending around $50 per visit.

Source: #1
Source: #1

Storytelling: a histogram is an appropriate graph for the initial exploration of a continuous variable. By means of a set of vertical bars, it shows how the numerical values ​​of that variable are distributed.

The histogram allows calculating the probability of representation of any value of the continuous variable under study, which is of great importance if we want to make inferences and estimate population values ​​from the results of our sample.

A histogram provides a visual representation of the distribution of a dataset: location, spread, and skewness of the data; it also helps to visualize whether the distribution is symmetric or skewed left or right.

In addition, if it is unimodal, bimodal, or multimodal. It can also show any outliers or gaps in the data. In brief, a histogram summarizes the distribution properties of a continuous numerical variable.

At this point it is important to clarify some statistical terms: population is the complete set of elements that make up the object under study; the broader group of people, cars, things, dollars spent, etc., to whom you intend to generalize the results of your study.

A sample is a subset of the entire population, obtained through some sampling method and selected to represent that population. Examples: the results of a survey or the data collected in the convenience store.

The mode is a measure of central tendency representing the value that occurs most frequently in a dataset. A unimodal distribution is a distribution possessing a unique mode or one single peak; bimodal if it presents two peaks or modes; multimodal if it presents two or more peaks. The following figure shows representations of distributions skewed left or right, symmetric, uniform, or multimodal.

Source: #2
Source: #2

Tips for histograms

· Always start the vertical axis baseline at 0. As the distribution is displayed by the height of the rectangles (for equal binwidths), we inevitably distort the visual if we modify the baseline;

· There are no strictly defined rules for the size and number of intervals. Always try out a few different values of binwidth. Although the visualization tools include their selection criteria for these parameters, it is essential to experiment with other values.

The number of intervals may be suggested by the nature of the dataset or by observing profound changes in the visual message with their size. Always keep in mind that: few intervals do not allow us to elucidate the fine structure of the data distribution; many intervals give importance to the sampling error.

Source: #3
Source: #3

· Although it is preferable to use the same binwidth for all intervals, it sometimes happens that there are very few numerical values for certain bins, particularly at the extremes. Remember the previous counterexample from people over 75 years old.

In these cases, accumulate these sparse values ​​in a wider interval. Include in the graph additional information clearly indicating that change.

Differences between histograms (HG) and standard bar charts (BC)

Standard bar charts are used to make numerical comparisons amongst categories whilst histograms are used to show the frequency distribution of a dataset;

BCs plots categories (discrete qualitative elements) while HGs graphs quantitative data grouped into intervals;

There are no "gaps" or spaces between the bars of a histogram; it is mandatory to leave some space between the bars on a BC to clearly indicate that it refers to discrete (mutually exclusive) groups.

BCs can be realigned (ascending, descending, alphabetical, etc.); HGs can’t be reordered because they have an intrinsic ordering;

HGs display areas whilst BCs display lengths. In an HG both the horizontal and the vertical axis have numerical values, so an area can be calculated. In a BC as one axis shows categories, there is no way to calculate areas;

All bars in a BC must have the same width. Histograms may have bars with different widths.

Observation: the word histogram derives from the Greek: histos means “anything set upright” and gramma means drawing or writing.

Extension 1: Overlapping Histograms

They are used to compare the frequency distribution of a continuous variable in two or more categories. Be very cautious because more than two histograms on the screen might confuse the audience.

Overlapped Histograms created with Matplotlib
Overlapped Histograms created with Matplotlib

Extension 2: Frequency Polygons

It is a graph derived from a typical histogram. It consists of connected line segments formed by joining the midpoints of the upper edges of the histogram’s bars. All bars in a frequency polygon must have the same width.

Source: #4
Source: #4

Frequency polygons are used as an alternative to overlapping histograms to compare simultaneously two or more frequency distributions. The usual procedure (as shown in the following figure) is to erase the bars that give rise to the histograms and leave only the resulting polygons.

Extension 3: Density Plots

AKA: Kernel Density Plots, Kernel Density Estimation, Density Trace Graphs

It is a “natural” extension of the histogram and uses the same numerical values ​​for its development. Density plots attempt to show the probability density function of the data set by means of a continuous curve. With that goal in mind, density graphs apply a statistical procedure (kernel density estimation) with the idea of ​​smoothing the rectangular bars that characterize the histogram. As a result, a smooth curve is obtained that allows a better visualization of the shape of the distribution.

Comparison of the histogram (left) and kernel density estimation (right) constructed using the same data. The 6 individual kernels are the red dashed curves; the blue curve is the kernel density estimation. The data points are the rug plot on the horizontal axis. Source: #5
Comparison of the histogram (left) and kernel density estimation (right) constructed using the same data. The 6 individual kernels are the red dashed curves; the blue curve is the kernel density estimation. The data points are the rug plot on the horizontal axis. Source: #5

A kernel is a symmetric function that is applied to a set of numerical values. The density estimation method accumulates all the information provided by the kernel function and generates a smooth curve that represents the final estimation of the density. The three most used kernel functions are Gaussian, Uniform, or Epanechnikov.

Density plots are two-dimensional plots with two axes: the vertical axis is a density axis while the horizontal axis is a numerical one. Density curves are usually scaled such that the area under the curve equals one. The peaks of the curves indicate where the values ​​of the dataset under study are concentrated.

Source: #6
Source: #6

The key idea in density plots is to eliminate the jaggedness that characterizes histograms (do not forget to compare the figures). To do this, it "induces" overlapping between histogram’s adjacent intervals or bins. The resulting smoothed version of the histogram indicates the probability density function of the variable under study.

Remember that the probability density function describes the relative probability of a continuous random variable.

The appearance of the density plot depends on two parameters: the kernel function and the bandwidth. This parameter is equivalent to the binwidth in the histogram. Always try out a few different values of the bandwidth for the same reasons described with the problems associated with the size of the intervals in the histogram. Also, it is advisable to experiment with different kernel functions.

Two final warnings: a) density plots require a significant amount of data for the smoothed curve to be truly representative of the underlying distribution; b) they have a tendency to produce the appearance of data where none exists, in particular in the tails. As a consequence, the careless use of density estimates can easily lead to figures that make nonsensical statements (Source: #7).










Created by

Darío Weitz







Related Articles