Big data is suddenly everywhere. From scarcity and difficulty to find data (and information), we now have a deluge of data. In recent years, the amount of available data has been growing in an exponential pace. This is in turn made possible due to the immense growth in number of devices recording data, as well as the connectivity between all these devices through the internet of things. Everyone seems to be collecting, analyzing, making money from and celebrating (or fearing) the powers of Big data. By combining the power of modern computing, it promises to solve virtually any problem — just by crunching the numbers.

But, can big data really deliver on all this hype? In some cases, yes, in others, maybe not. On the one hand, there is no doubt that big data has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence solution involves some serious number crunching.

The first thing to note is that although AI is currently very good at finding patterns and relationships within big datasets, it is still not very intelligent (depending on your definition of intelligence, but that’s another story!). Crunching the numbers can effectively identify and find subtle patterns in our data, but it cannot directly tell us which of those correlations are actually meaningful.

Correlation vs. Causation

We all know (or should know!) that “Correlation doesn’t imply causation”. However, the human mind is hardwired to look for patterns, and when we see lines sloping together and apparent patterns in our data, it is hard for us to resist the urge to assign a reason.

Statistically we can’t make that leap, however. Tyler Vigen, the author of Spurious Correlations, has made sport of this on his website (which I can very much recommend visiting for a look at some entertaining statistics!) Some examples of such spurious correlations can be found in the figures below, where I have collected a few examples showing how Ice cream are apparently causing a lot of bad things, ranging from forest fires to shark attacks and polio outbreaks.

Having a look at these plots, one could argue that we should probably have banned Ice cream a long time ago. And, actually, in the 1940s Polio example, public health experts recommended that people stop eating ice cream as part of an “anti-polio diet”. Fortunately, they eventually came to realize that the correlation between polio outbreaks and ice-cream consumption was simply caused by the fact that polio outbreaks were most common during summer.

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor (referred to as a “common response variable”, “confounding factor”, or “lurking variable”). An example of such “lurking variables” could e.g. be the seeming correlation between ice cream sales and shark attacks (I feel quite confident that increased sales of ice cream does not cause sharks to attack people). However, there is a common link behind these two numbers, namely temperature. Higher temperature causes more people to buy ice cream as well as more people to go for a swim. Thus, this “lurking variable” is really the cause of the apparent correlation. Luckily, we have learned to separate correlation from causation, and we can still enjoy some ice cream on a hot summer day without fearing polio outbreaks and shark attacks!

The power and limits of correlations

With enough data, computing power and statistical algorithms patterns will be found. But are these patterns of any interest? Not all of them will be, as spurious patterns could easily outnumber the meaningful ones. Big data combined with algorithms can be an extremely useful tool when applied correctly to the right problems. However, no scientist thinks you can solve the problem by crunching data alone, no matter how powerful the statistical analysis. You should always start your analysis based on an underlying understanding of the problem you are trying to solve.

Data science is the end of science (or is it?)

In June 2008, C. Anderson, former editor-in-chief of Wired Magazine, wrote a provocative essay titled: “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”,in which he states that “with enough data, the numbers speak for themselves”. “Correlation supersedes causation, and science can advance even without coherent models and unified theories”.

The strength and generality of this approach relies on the amount of data: the more data, the more powerful and effective is the method based on computationally discovered correlations. We can simply feed the numbers into powerful computers and let statistical algorithms automatically find interesting patterns and insight.

Unfortunately, this simplified way of analysis has some potential pitfalls, which can be illustrated nicely through an example found on the blog by John Poppelaars:

Suppose we would like to create a prediction model for some variable Y. This could for example be the stock price of a company, the click-through rates of online ads or next week’s weather. Next we gather all the data we can lay your hands on and put it in some statistical procedure to find the best possible prediction model for Y. A common procedure is to first estimate the model using all the variables, screen out the unimportant ones (the ones not significant at some predefined significance level) and re-estimate the model with the selected subset of variables and repeat this procedure until a significant model is found. Simple enough, isn’t it?

Anderson suggested way of analysis has some serious drawbacks however. Let me illustrate. Following the above example, I created a set of data points for Y by drawing 100 samples from a uniform distribution between zero and one, so it’s random noise. Next I created a set of 50 explanatory variables X(i) by drawing 100 samples from a uniform distribution between zero and one for each of them. So, all 50 explanatory variables are random noise as well. I estimate a linear regression model using all X(i) variables to predict Y. Since nothing is related (all uniform distributed and independent variables) an R squared of zero is expected, but in fact it isn’t. It turns out to be 0.5. Not bad for a regression based on random noise! Luckily, the model is not significant. The variables that are not significant are eliminated step by step and the model re-estimated. This procedure is repeated until a significant model is found. After a few steps a significant model is found with an Adjusted R squared of 0.4 and 7 variables at a significance level of at least 99%. Again, we are regressing random noise, there is absolutely no relationship in it, but still we find a significant model with 7 significant parameters. This is what would happen if we just feed data to statistical algorithms to go find patterns.”

The larger the data set, the stronger the noise

Recent research has provided proof that as data sets grow larger they have to contain arbitrary correlations. These correlations appear simply due to the size of the data, which indicates that many of the correlations will be spurious. Unfortunately, too much information tends to behave like very little information.

This is a major concern in applications where you work with high-dimensional data. As an example, let’s say you gather sensor data from thousands of sensors on an industrial plant, and then mine these data for patterns to optimize performance. In such cases, you could easily be fooled into acting upon phantom correlations rather than real indicators of operational performance. This could potentially be very bad news, both financially and in terms of safe operation of the plant.

Adding data vs. adding information

As data scientists, we might often claim that the best solution to improving our AI model is to “add more data”. However, the idea that just “adding more data” will magically improve the performance of your model might not be the case. What we should focus on is rather to “add more information”. The distinction between “adding data” and “adding information” is crucial: Adding more data does not equal adding more information (at least useful and correct information). On the contrary, by blindly adding more and more data, we encounter the risk of adding data that contains misinformation that can accordingly downgrade the performance of our models. With the abundant access of data, as well as the computing power to process it, this becomes increasingly important to consider.

Outlook

So, should the above challenges stop you from adopting data driven decision making? No, far from it. Data driven decision making is here to stay. It will become increasingly valuable as we gain more knowledge on how to best harness all available data and information to drive performance, that being clicks on your website or optimal operation of an industrial plant.

However, it is important to be aware that it requires more than just hardware and lots of data to succeed. Big data and computing power are important ingredients, but it is not the full solution. Instead, you should understand the underlying mechanisms that connect the data. Data will not speak for itself, we give numbers their meaning. The Volume, Variety or Velocity of data cannot change that.