Photo by Emile Perron on Unsplash

This article identifies the most common mistakes prospect data scientists make and discuss how to avoid them. Without further ado, let’s jump straight into it.

Inefficient use of pandas

Pandas is a library frequently used by data scientists to handle structured data. The most common mistake is iterating through the rows in the dataframe using a “for loop”.

Pandas have built-in functions e.g. apply or applymap that enables you to apply a function to a selection or all columns for all the rows of the dataframe or when conditions are met.

Optimally, however, you could work with Numpy arrays that provide the most efficiency.

Inefficient:

my_list=[]
for i in range(0, len(df)):
l = myfunction(df.iloc[i]['A'], df.iloc[i]['B'])
my_list.append(l)
df['newColumn'] = my_listTime:640ms

Efficient:

df['newColumn'] = myfunction( df['A'].values, df['B'].values)Time: 400 µs

or given a dataframe as below:

df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9

calculate the square root of every element

df.apply(np.sqrt)
>>> df
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0

source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

2. Inefficient use of Sklearn

Sklearn (scikit-learn) is a Python library used by data scientists to train machine learning models.

Usually data are required to be transformed before training a model, but how do you make a pipeline of transforming the data and then training? You would probably like to automate that, especially when you want to apply transformations at every fold in your cross-validation.

Sadly, the most frequently used function within the library is train_test_split. Many data scientists just transform the data, use train_test_split, train on one set and test on the other.

It doesn’t sound that bad but it might be in particular when you don't have a big dataset. Crossvalidation is important to assess how the model is able to generalise on unseen data. Thus, some data scientists might be keen on implementing a function to do the transformation before every validation iteration.

Sklearn has a Pipeline object that you can use to both transform the data and train a machine learning model in your cross-validation process.

In the following example, PCA is applied prior to logistic regression for every fold of the cross-validation and at the same time, a grid search is performed to find the parameters of the model as well as the PCA’s ones.

import numpy as np import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV # Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA() # set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1) pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

X_digits, y_digits = datasets.load_digits(return_X_y=True)

# Parameters of pipelines can be set using ‘__’ separated parameter
names:
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'logistic__C': np.logspace(-4, 4, 4),
} search = GridSearchCV(pipe, param_grid, n_jobs=-1) search.fit(X_digits, y_digits) print("Best parameter (CV score=%0.3f):" % search.best_score_) print(search.best_params_)

Source: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

3. Not using functions

Some data scientists do not care as much about their code presentation and format, but they should.

Writing in functions compare to just script everything in a single file or notebook has several benefits. This is not just easier to debug but the code can be reusable and easier to understand.

This post shows in more detail the use of functions in data science, even though I disagree with the author that you could not replicate that approach in a Jupyter notebook. In addition, as a data scientist you should aim to keep your code DRY.

That is Don’t Repeat Yourself! Eliminating repetition makes the code re-usable by you as well as by other people that might come across your code. Also, it helps with maintenance and identification of bugs.

For instance, you could have lines of dataframe operations like this:

df.drop(columns=['A','B'],inplace=True) df['datetime']=pd.to_datetime(df['dt'])
df.dropna(inplace=True)

but wouldn’t be efficient, as every time you want to preprocess a dataframe you would have to copy paste all lines of code and edit the column names as needed.

On the contrary, having a function dealing with all operations, it is much better.

processor = Preprocess(columns_to_drop, datetime_column, dropna_columns)

You can find the implementation of Preprocess here.