Image from shutterstock.com

“People who are crazy enough to think they can change the world are the ones who do.”

-Rob Siltanen-

Intro

For the last decade, advances on improvement of models have been increasing on many directions because the demand for visible performance is reachable on a global scope. Decision makers don’t need to be Statisticians to understand the value of increasing revenue or decreasing costs.

Thousands of companies around the world, from small startups to global corporations, find great value in improving the performance of their supervised or unsupervised ML models, whether it’s a sales or demand forecast, a market basket analysis recommender, a customer classifier, a sales optimizer, a chatbot, an algorithmic trading pipeline, a document labeler, an elections forecast, a spam filter, a medical diagnosis solution, a route optimizer, a face recognizer or a self-driving car.

And I’m not even going to get started on IoT.

However, all of them seem to attempt to increase accuracy (reduce error) by focusing on mainly two things:

1) Feature engineering (getting the most out of your features by crunching your dataset to death)

2) Model/parameter optimization (choosing the best model and best parameters even if you have to come up with a hybrid of several algorithms and iterate to infinity)

Both of the above are very necessary indeed, but there is a third process that adds value in a complementary way, which has traditionally been wildly underused in most data science projects and is now starting to take off.

Adding external data.

Over 90% of the world’s data has been created in the last two years alone, and volumes are expected to continue growing exponentially. Every 6 hours, one quintillion bytes of data are generated globally.

You can’t come up with an intuitive reference for how much that is without recurring to stars or atoms and still, that figure will seem laughable in a couple of years.

On the other side, we have broad access to cutting-edge systems, like neural networks with genetic algorithms, that are remarkable at explaining one variable with other variables on the same dataframe (once they are in a tidy, numerical format).

So the question isn’t IF the two worlds are going to meet, the question is WHEN, and the answer is starting to look like NOW.

With so many sudden changes impacting this highly uncertain and socially-distanced way of life, it is especially challenging to generate accurate predictions relying solely on internal data. Therefore, it is now more relevant (and feasible) than ever to enhance ML models with external data that can provide a more complete view of the problem at hand.

“Good data scientists are looking to find good, clean influencing data to blend with their own data to make more accurate predictions.”

‘4 Ways to Differentiate Your Analytics Product by Including External Datasets’ Gartner research report by Kevin Quinn and Emil Berthelsen, 24 July 2020.

Data Scientists tend to be discouraged to add external data to their models as they believe there is a low benefit/effort ratio because it’s a lot of work to gather, process, profile and join unstructured data in a completely different formats. Moreover, the decision to add data is ‘only based on a hunch’ and there could be no relationship at all.

But the thing is, it can be waaay simpler than you’d think. Here’s a technical tutorial of Ways to Blend External Data to your dataset using Python or R. Spoiler alert: one-liners.

So, now that model enrichment with useful variables from open data is available for everyone, the time has come for ML dependent enterprises to adapt or be outperformed.

Big things are coming.