cft

Why, as a data scientist, you should take your time

In the era of “here and now”, taking your time can be the key of a good work


user

Piero Paialunga

3 years ago | 3 min read

The first thing I want to say is that, as a data scientist, I don’t like to talk about “non technical things”.

It gives me the idea that I’m one of the online guru that wants to teach you how to live your life, eat healthy food and meditate when you wake up (and you should wake up at 5 a.m.).

Well, I don’t want to be that guy, but in this special case, I think a non-technical talk is ideal for the topic I want to discuss.

Some days ago I was watching a TED talk of a well known Italian singer. It explained how, after a long unsuccessful (or not-so-succesful) carrer, he managed to write an hit song and to finally become famous.

I know that it is a well known tale so far, but it is not the interesting part of the story. In fact, he described as, while all the people around him suggested to keep surfing the wave of his hit song, he decided to take a step back.

The reasoning that makes him do this (apparently) crazy move is simple:

You need to take your time.

While this may seem a philosophical thought, the principle of this concept is extremely practical. If you want quality, you need to make it slow.

In his TED talk, he gives some examples of it, mostly related to the world of entertainment, but I think that the principle is even more general.

We are surronded by the idea of “here and now” : if you take this now, you will have it for free; if you get this service, things will come faster; if you follow my old music, new music will come out soon.

But the truth is that things take time. Data science is not an exception.

When we look at the great technology improvements we realize that we have some crazy computational power… basically for free. Machine Learning applications often come in “pretrained” models that are extremely easy to apply to your data.

This scenario is so frequent that big companies know that and they consider the big pretrained models to be the “baseline”. For example, in the case of fast MRI, the well known Facebook AI challenge that aims to drastically reduce the MRI scans’ acquisition time, a 214 x 10^{6} parameter network is considered to be the baseline.

For this reason, they give you the opportunity to use the pretrained model directly on the dataset, and they ask you to improve its result with your (most likely Deep Learning) algorithm.

This huge amount of power that we have pushes us at looking for the best model possible. Thus, a great number of deep neural networks has been developed during the years and released by big companies (AlexNet, GoogLeNet, UNet…).

If we want to make it in data science we may think that what we need to do is to develop a deep neural network with an Avogadro’s number of parameters.

Anyway, if we are not advanced stage researchers with a clean dataset that has been accurately filtered… it is probably the wrong way to climb the mountain.

As I’ve said at the beginning of the post, the most importance thing is to take time.

As a physicist, I have been so instructed into that (maybe even too much instructed, sometimes…): the key part of a data analysis process is to look at the data. We may think of really basic things like:

“Do we have any missing values? Do we have any outlier? If we are performing a classification task, are our classes equally-sized?”

Or end up with more sophistical things like:

“Are our data normalized? If we dealing with textual data, is the PCA effective? If we are dealing with images, what does the Fourier space tells us?”

If we don’t have the answers to all these questions it is pointless to use sophisticated Deep Learning networks!

Plus, as you may have noted, these considerations are extremely general. In your own dataset, you may have different domain-specific things to look at and investigate.

Moreover, these checks are not only meant to control your data like a master of puppets.

Again, the outcome of this process may be extremely practical and you may realize that data are linearly separable in some abstract feature space, or that if you use a simple linear regression after some pre-processing step, your classification gets a really good accuracy.

To summarize, the use of Deep Learning is meant to reduce the domain-knowledge that we need to have while we face Data Science problems. On the other hand, the total absence of checks on your data may lead to a wrong usage of computational resources… and it’s a silly move.

So, before you go (like the Lewis Capaldi’s song :) ) please remember:

Take your time… and explore your data.

Upvote


user
Created by

Piero Paialunga


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles