Top 3 programming mistakes every data scientist makes
How to use Pandas, Sklearn and functions
Alexandros Zenonos
Photo by Emile Perron on Unsplash
This article identifies the most common mistakes prospect data scientists make and discuss how to avoid them. Without further ado, let’s jump straight into it.
Inefficient use of pandas
Pandas is a library frequently used by data scientists to handle structured data. The most common mistake is iterating through the rows in the dataframe using a “for loop”.
Pandas have built-in functions e.g. apply
or applymap
that enables you to apply a function to a selection or all columns for all the rows of the dataframe or when conditions are met.
Optimally, however, you could work with Numpy arrays that provide the most efficiency.
Inefficient:
my_list=[]
for i in range(0, len(df)):
l = myfunction(df.iloc[i]['A'], df.iloc[i]['B'])
my_list.append(l)
df['newColumn'] = my_listTime:640ms
Efficient:
df['newColumn'] = myfunction( df['A'].values, df['B'].values)Time: 400 µs
or given a dataframe as below:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
calculate the square root of every element
df.apply(np.sqrt)
>>> df
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
2. Inefficient use of Sklearn
Sklearn (scikit-learn) is a Python library used by data scientists to train machine learning models.
Usually data are required to be transformed before training a model, but how do you make a pipeline of transforming the data and then training? You would probably like to automate that, especially when you want to apply transformations at every fold in your cross-validation.
Sadly, the most frequently used function within the library is train_test_split
. Many data scientists just transform the data, use train_test_split
, train on one set and test on the other.
It doesn’t sound that bad but it might be in particular when you don't have a big dataset. Crossvalidation is important to assess how the model is able to generalise on unseen data. Thus, some data scientists might be keen on implementing a function to do the transformation before every validation iteration.
Sklearn has a Pipeline object that you can use to both transform the data and train a machine learning model in your cross-validation process.
In the following example, PCA is applied prior to logistic regression for every fold of the cross-validation and at the same time, a grid search is performed to find the parameters of the model as well as the PCA’s ones.
import numpy as np import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV # Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA() # set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1) pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
X_digits, y_digits = datasets.load_digits(return_X_y=True)
# Parameters of pipelines can be set using ‘__’ separated parameter
names:
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'logistic__C': np.logspace(-4, 4, 4),
} search = GridSearchCV(pipe, param_grid, n_jobs=-1) search.fit(X_digits, y_digits) print("Best parameter (CV score=%0.3f):" % search.best_score_) print(search.best_params_)
Source: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
3. Not using functions
Some data scientists do not care as much about their code presentation and format, but they should.
Writing in functions compare to just script everything in a single file or notebook has several benefits. This is not just easier to debug but the code can be reusable and easier to understand.
This post shows in more detail the use of functions in data science, even though I disagree with the author that you could not replicate that approach in a Jupyter notebook. In addition, as a data scientist you should aim to keep your code DRY.
That is Don’t Repeat Yourself! Eliminating repetition makes the code re-usable by you as well as by other people that might come across your code. Also, it helps with maintenance and identification of bugs.
For instance, you could have lines of dataframe operations like this:
df.drop(columns=['A','B'],inplace=True) df['datetime']=pd.to_datetime(df['dt'])
df.dropna(inplace=True)
but wouldn’t be efficient, as every time you want to preprocess a dataframe you would have to copy paste all lines of code and edit the column names as needed.
On the contrary, having a function dealing with all operations, it is much better.
processor = Preprocess(columns_to_drop, datetime_column, dropna_columns)
You can find the implementation of Preprocess here.
Upvote
Alexandros Zenonos
Data Scientist

Related Articles