How To Deconstruct a Machine Learning Algorithm in Python
Unboxing the black box with Game Theory
You don’t have to be an experienced developer to use Machine Learning (ML) algorithms in your code. In fact, it’s almost comical how low the entry point is to implement your very own ML algorithm. If you’re like me, you might get a feeling of incompleteness after finishing a small ML project, like: “Okay, what now?”, “What do these numbers mean?”, “How did each feature in my dataset influence the results?”, “How do I visualize this?”.
As it turns out there’s a handy Python package called SHAP that can answer many of these questions for you.
From their documentation, SHAP, SHapley Additive exPlanation, “is a game-theoretic approach to explain the output of any machine learning model.” Based on the Shapley Value, named after Lloyd Shapley, who introduced the model in 1951 and won the Nobel Prize in Economics in 2012.
Install SHAP with:
pip install shap
The Power of Visualization
The most compelling part about this library is its ability to visualize how each data point of each feature in your data set contributed to the model’s prediction. I tried SHAP’s visualizer out on a common housing price data set used in machine learning applications to see how much additional meaningful information it could give me.
For this experiment, I trained a model using Sci-Kit Learn’s XGBoost (eXtreme Gradient Boosting). The summary plot is generated with the ‘shap.summary_plot’.
The x-axis shows the magnitude of the impact on the model (the median house price), the y-axis ranks the features by importance. The importance is measured as the average Shapley value. The color displays the magnitude of the feature.
Interpreting the Graph
From the visualizer, we can gather that when the LSTAT (% lower status of the population) was high, it had a very large negative impact on the model’s prediction of the house price.
Conversely, a low LSTAT had a very large, yet more concentrated, positive impact on the model’s prediction of the house price. A high NOX value (nitric oxides concentration in PPM) both had a slightly positive and negative effect on the model’s prediction but had a net negative impact. A low NOX value on the other hand only served to increase the model’s prediction of the house price.
The SHAP visual explainer allows you to easily draw conclusions about every feature. It’s quite interesting to go through each one individually and hypothesize why each feature might have such an impact on the model’s prediction.
SHAP also allows you to drill down further into individual housing predictions to see how each feature affected the model’s prediction of price; as in, choosing a specific house in the data set to see how each factor (feature) contributed to the model’s prediction of said house’s price.
This more granular explanatory visualization can be called upon with the .force_plot method, displayed below:
Even if you have a solid understanding of interpreting machine learning models, this library may seem to be a bit of a novelty. Well, if you’ve ever had to explain your machine learning project to a less tech-savvy friend or co-worker, you might rethink its value.
A picture is worth a thousand words and this just might be the closest thing we can get to a picture of a machine learning algorithm.