As data scientists, one of the reasons we are employed is because of our machine learning skills. In the paper, it sounds exciting to learn about artificial intelligence and machine learning. Still, as we are going deeper into the matter, we realize that machine learning is not as easy as it looks.

You might produce a machine learning supervised model with a single line of code — like what all the experts do in the industry. Many experts have developed the complex math and statistic behind the model into a one-liner code that helps our everyday job. However, understanding what the model does behind the code is another story.

Why thou we need to understand the Machine Learning concept if this code does work fine? There are many reasons I could state, but the most important reason is to select the perfect model for the current problem you try to solve. Without machine learning understanding, you would hardly find the optimal solution.

That is why in this article, I want to show you my top 4 python package to learn Machine Learning. Let’s get into it.

1. Scikit-Learn

The king of Machine Learning modeling in Python. There is no way I would omit Scikit-Learn in my list as your learning references. If for some reason, you never heard about Scikit-Learn, this module is an open-source Python library for machine learning built on top of SciPy.

Scikit-Learn contains all the common Machine Learning models we use in our everyday data science work. According to the homepage, Scikit-learn supports supervised and unsupervised learning modeling. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

There are many APIs available within Scikit-Learn for your machine learning purposes. We could group them into 6 sections:

Classification
Regression
Clustering
Dimensionality reduction
Model selection
Preprocessing

To get a better understanding of the machine learning concept and the APIs work, Scikit-Learn has provided a comprehensive user guide you could follow. The guide is easy enough for beginners to follow even with a little statistical knowledge (You still need to learn some statistics).

If you are using Python from Anaconda distribution, the Scikit-Learn package is already inbuilt within the environment. If you choose to install the package independently, you need to install the dependence package. You could do that via pip by executing this line below.

pip install -U scikit-learn

Let’s try to learn the simplest model — Linear Model. As I mentioned above, Scikit-Learn contains a comprehensive user guide for people to follow. If you never develop any machine learning model, let’s review the Scikit-Learn user guide for the Linear Model.

#Develop the linear regression model
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])reg.coef_

With a single line of code, you are now successfully developing the Linear Regression model. You could check the Linear Model user guide for further exploration as they have a complete study guide. If you are interested in another machine learning model, you could check the user guide for more learning material. Below is the image from the scikit-learn homepage as your reference.

2. Statsmodel

Statsmodels is a statistical model python package that provides many classes to develop the statistical model. Statsmodel package is a part of the Scipy module, but the statsmodel package is currently developed separately.

Statsmodel package focuses on the statistical estimation based on the data. In other words, it generalized the data by creating a statistical model or what we called a machine learning model.

Statsmodel provides API that is frequently used in Statistical modeling. Statsmodel package split the APIs into 3 main models:

statsmodels.api which provide many Cross-sectional models and methods, including Regression and GLM.
statsmodels.tsa.api Which provide Time-series models and methods.
statsmodels.formula.api Which provide an interface for specifying models using formula strings and DataFrames — in simpler term, you could create your own model.

Statsmodel is a great starter package for anybody who wants to understand statistical modeling in greater depth. The user guide gives you an in-depth explanation of the concept you need for understanding statistical estimation and the statistical explanation behind the machine learning model.

Let’s try to learn one Linear Regression machine learning model using the statsmodel package. The guide has explained the model, which I showed you in the image below.

Linear Regression Techincal Documentation (Source: https://www.statsmodels.org/stable/regression.html)

As you can see, the documentation is massive in information and definitely a worthy learning material.

Let’s try to learn the OLS (Ordinary Least Square) modeling using the Statsmodel package. If you did not use the Python from the Anaconda distribution or installed the Statsmodel package, you could use the following line to do it.

pip install statsmodels

Continuing the steps, let’s develop the model by importing the package and the dataset.

#Importing the necessary package
from sklearn.datasets import load_boston
import statsmodels.api as smfrom statsmodels.api import OLS#import the data
boston = load_boston()
data = pd.DataFrame(data = boston['data'], columns = boston['feature_names'])
target = pd.Series(boston['target'])
#Develop the model
sm_lm = OLS(target, sm.add_constant(data))
result = sm_lm.fit()
result.summary()

The OLS model you develop with the Statsmodel package would have all the necessary results you expect from machine learning model estimation. For further interpretation of the result, you could visit the OLS example on the homepage.

3. Eli5

Machine learning is not complete without the explainability behind the model. From my experience working as a Data Scientist, most of the time, you would need to explain why your model is working and what kind of insight your model gives. By insight, I am not referring to the model accuracy or any metric but the machine learning model itself. This is what we called Machine Learning Explainability.

There are many advanced ML Interpretation Python Package out there, but most of them are too specific which devoid of any learning opportunities. In this case, I recommended Eli5 for your Machine Learning interpretability study package as it offers all the basic concepts without many complicated concepts.

Taken from the Eli5 package, the basic usage of this package is to:

inspect model parameters and try to figure out how the model works globally;
inspect an individual prediction of a model and figure out why the model makes the decision.

You could learn how to interpret your machine learning from the explanation above — especially the black-box model. My favorite learning material is the Permutation Importance, as it is the most basic way to explain your machine learning.

Let’s try to learn the Permutation Importance by using the Eli5 package. First, we need to install the package by using the following code.

#installing eli5
pip install eli5
#or
conda install -c conda-forge eli5

Let’s try to prepare a sample dataset for sample practice.

#Preparing the model and the datasetfrom xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
mpg = sns.load_dataset('mpg')
mpg.drop('name', axis =1 , inplace = True)#Data splitting for xgboost
X_train, X_test, y_train, y_test = train_test_split(mpg.drop('origin', axis = 1), mpg['origin'], test_size = 0.2, random_state = 121)#Model Training
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)

After installing and prepare the sample data, we would use the Eli5 package for our Machine Learning Explainability using Permutation Importance.

#Importing the module
from eli5 import show_weights
from eli5.sklearn import PermutationImportance#Permutation Importance
perm = PermutationImportance(xgb_clf, scoring = 'accuracy' ,random_state=101).fit(X_test, y_test)
show_weights(perm, feature_names = list(X_test.columns))

For further interpretation of the result, you could visit the user guide which has an adequate explanation.

4. MLflow

The current state of Machine Learning education is not limited to the machine learning model, but it is expanded into the automation process of the model. This is what we called MLOps or Machine Learning Operations.

Many open-source Python packages support the MLOps lifecycle, but in my opinion, MLflow has a complete MLOps learning material for any beginner.

According to the MLFlow homepage, MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. This package handles 4 functions, they are:

Experiments tracking (MLflow Tracking),
ML code reproducible (MLflow Projects),
Managing and deploying models (MLflow Models),
Model central lifecycle (MLflow Model Registry).

I love this package because they explain the MLOps concept in an organized way that all beginners could follow. You could check the concept package for more detail.

I would not give an example of the MLflow because I want to dedicate a single article for this package, but I reckon you should learn from this package to understand the machine learning and MLOps concept in more detail.

Conclusion

As data scientists, one of the reasons we are employed is because of our machine learning skills. Many learn the code without knowing the concept behind machine learning and what we could do with the model.

To help the study, I want to introduce my top 4 Python Packages to learn machine learning. They are:

Scikit-Learn
Statsmodels
Eli5
MLflow

I hope it helps!