Confirmation Bias Is The Enemy Of Exploratory Data Analysis
Humans have a tendency to want to prove prior beliefs and this can have devastating effects
Between the 16th and 19th centuries, in Western Europe, tens of thousands of women were executed during witch-hunts. Due to the difficult nature of identifying witches, special tests were used to determine whether or not a woman was a witch.
One such example involved throwing the woman into the water with her hands tied behind her back. If she floated, she was a witch, assumed to have been saved by Satan, and was sentenced to death. If she drowned, she was innocent.
Although the link between witch-hunting and data analytics may not be immediately clear, they are both subject to a cognitive bias known as confirmation bias.
Confirmation bias occurs when a person searches for, or interprets, information to conform with their prior beliefs.
Witch-hunting has often been used as an example of confirmation bias. There is no practical significance in proving that a woman is innocent of witchcraft if she dies in the process. But this wasn’t the point of the test. Instead, it was designed purely as a method of confirming the prior belief, guilt.
Exploratory Data Analysis
Exploratory data analysis (EDA) is the initial investigation of data, usually using statistics and graphical representations, to summarise and understand it.
In essence, EDA is the process of answering questions about the data. Some of these may be very simple:
- How many rows/columns does the data have?
- What are the column types?
- Is there any data missing?
However, as EDA progresses, the questions start to become more complex. Let’s take the example of a dataset that shows box office information for a new film. One question might be
Are there more ticket sales on weekends?
You may notice that I have written this question to ensure that it can be answered with a simple “yes” or “no”. This may seem trivial, but it is here that bias can start to occur. I could have written the question differently,
Do more ticket sales occur on particular days of the week?
Instead of devising a question to ascertain the distribution of ticket sales through the week, I have asked a question with the purpose of validating my existing belief, that more tickets are sold during the weekend.
The context may be a mile off, but it’s just the same as the witch-tests, which were devised to prove only that women are witches, not whether a woman is a witch.
The Problem of Bias
This may seem like a trivial point. At the end of the day, you will end up determining whether weekends sell more tickets or not.
The problem isn’t the analysis in and of itself, but what happens after. By its very nature EDA is meant to be a stepping stone to another question…so what?
We use the information discovered during EDA to drive change; perhaps adapt an existing process or create a new one. But change by its very nature involves the upheaval of prior beliefs.
If confirmation bias stops people from searching for evidence that contravenes their prior beliefs, then no amount of EDA is going to drive meaningful change.
What would happen instead?
Imagine we found out ticket sales were higher during the week. Instead of acting on this (maybe increasing advertising during the weekend to boost weekend ticket sales), most people would jump down the rabbit hole of explainability.
Using further (often contrived) analysis they would try to explain why their belief is right despite the evidence. Then, instead of EDA becoming a vehicle for change, it becomes the ego boost, so to speak, of invalid theory.
Mitigating confirmation bias
Humans have a natural tendency towards confirmation bias, and we often can’t tell that we are doing it. Therefore, we need to make sure that we are taking active and purposeful measures to mitigate against it.
Small changes, such as reframing the question, as above, can make a large difference as we are subconsciously removing ourselves from such biases.
None of this is to say that we should completely disregard our prior beliefs. We usually have them for a reason and if there is evidence to the contrary we should question it. Rather, we need to make sure that when we do question it we do so from a fair and balanced perspective.
This is often something that is very difficult to do, so another technique to avoid confirmation bias is to get a second opinion. Ask someone else to look at the evidence and see what they think.
EDA is a powerful tool founded in data. But data not only contains inherent bias, but it is also biased by the people analysing it. It is our responsibility to understand where and when we might have biases and to mitigate against them.
Data Science Consultant, NLP enthusiast, Physics graduate https://medium.com/@jonnyndavis