ESPN’s The Last Dance, the long-anticipated documentary which details the ins-and-outs of Michael Jordan’s sixth and final NBA championship in 1998 with the Chicago Bulls, left me really intrigued and wanting to know more.

Being barely a year old by the time Jordan completed his last season with the Bulls, I have never truly experienced the magic of His Airness himself — how he has managed to completely dominate an era and earn legendary status as the Greatest of All Time.

Whether it’s leading the Bulls to a 72–10 record during the 1995–96 season, winning a total of 5 MVP awards or creating a shoe brand that has transformed popular culture over the years, there is widespread belief that no one will ever come close to reaching his level of greatness.

But how would Jordan compare with the players currently in the NBA today, a league which I have gotten to know so well over the years? Using analytics, I hope to find just how much MJ would be valued by a team based purely on the stats that he can deliver night in, night out.

Problem Introduction

For the purposes of this analysis, I will use a player’s salary as a metric to measure how valuable a player is.

I am aware that so many players in the league are overpaid (looking at you, Andrew Wiggins) or underpaid, but I wanted to know just how much Jordan would be paid in today’s league and be able to compare his projected salary to the superstars of my time.

I will apply a random forest algorithm to a dataset containing all NBA players from the 2019–2020 season, validate the accuracy of the model then test it on a new dataset containing all players in the 1997–1998 season, the season when Jordan won his final championship and respectfully earned his godlike status.

Michael Jordan and teammate Scottie Pippen. Photo by Getty Andy Hayt / NBAE

Diving into the Data

For this analysis, I retrieved NBA player data for both the 2019–20 and 1997–98 regular seasons from Basketball Reference and self-selected basic and advanced statistics that are commonly used as performance indicators. Here is a glossary of some of the advanced terms to get you up to speed:

I also added additional columns for ‘Rookie Scale’ and ‘Max’, which are dummy variables indicating 1 for players with rookie deals (players who are in their first two years in the league) and max deals (superstars who can earn up to 35% of the team’s salary cap) during that season, and 0 for all other players. The final column is player salary, our dependent variable of interest.

Overall, the datasets contain 20 columns with 412 entries for the 2019–20 season and 19 columns with 439 entries for the 1997–98 season, leaving out player salary (which we will be predicting!). Here is a preview of both:

2019–20 NBA Season Player Data

1997–98 NBA Season Player Data

Describing the Dataset

Now that I had the data cleaned up and ready to go, we will start the process of creating a model to predict salary using the 2019–20 player data. I first wanted to see the distribution of player salaries league-wide, then I wanted to explore how salary is related to some of the performance indicators.

The data is heavily skewed to the right, with around half of the players earning less than $5 million. As expected, all the rookies are on the lower level grade in terms of pay (< $10 million) while all max players earn a salary of $25 million or higher.

The plots above show that the relationship between salary and all of the other variables hardly follow a linear trend. This indicates that using linear regression methods might not be the best tool to make our predictions.

However, there are so many other predictive analysis tools we can use, including random forests (which I will use in the next part of this analysis!).

Setting up the Dataset for the Random Forest Algorithm

The random forest algorithm makes predictions by averaging the results of many decision trees, with each tree in the forest considering a random subset of features and only having access to a random set of data points (hence the name!) Check out this page for a wonderful explanation of this method.

In order to use the random forest package in Python, I have to separate the data into its X and y variables, a format that the package will accept. Here is how I did it:

# Omit the player names and include only numerical variables
data = data.iloc[1:20]# Set up a variable to store the Salary values and remove it from the initial data
y = np.array(data['Salary'])
data = data.drop(['Salary'], axis=1)
X = np.array(data)# Save the column names of the X variables
columns = list(data.columns)

The dataset is split up and ready! The random forest model itself is really straightforward to use in Python — after importing the package from ScikitLearn I will specify for the model to use 5,000 trees and apply a random seed so that I can reproduce my results in the future if I need to.

# Import the random forest algorithm and specify parameters
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 5000, random_state = 42)

Using K-Fold Cross Validation for Training

I now have to train the random forest model on our dataset, test it on a new unseen sample and prove that the model is robust enough to produce meaningful results.

Rather than splitting the data into training and testing sets, I will use k-fold cross validation which usually results in a less biased model. This approach ensures that every observation from the original dataset has the chance of appearing in a training and testing set. Check out this cool page for a detailed explanation!

Again, I can conveniently use another package in ScikitLearn to import the cross-validation package. This time, I will use k=10 folds which is usually common practice. I will then apply the model onto each fold and measure its accuracy for each fold.

# Import the k-fold cross validation and accuracy scoring function
from sklearn.model_selection import KFold, cross_val_score# Specify the number of folds (k = 10)
k_fold = KFold(n_splits=10)# Train random forest by iterating over each of the 10 folds
score = [rf.fit(X[train], y[train]).score(X[test], y[test]) for train, test in k_fold.split(X)]# Round the accuracy scores and print
[round(num, 3) for num in score]>> [0.604, 0.761, 0.794, 0.672, 0.776, 0.616, 0.819, 0.653, 0.5, 0.848]

As the model iterates over each fold, it produces different accuracies based on the segment of the dataset it is testing on. However, I believe that the model produces a range that is acceptable (a low of 50% and going up to 84.8%) and is pretty robust given our circumstances! On average, this model can predict player salaries with 70.4% accuracy.

From here, the random forest also allows me to see which variables contributes the most in predicting salary. I plotted these importances below:

Intuitively, the ‘Max’ dummy variable usually provides strong indication that a player will be paid significantly more than others. In addition, ‘PTS’ (points per game), ‘WS’ (win shares) and the ‘Rookie Scale’ dummy variable also help determine salary albeit being less important indicators.

Testing the Model on a New Dataset

That’s it — we have successfully built and trained our own salary prediction model! Now, I will use this model on our 1997–98 NBA player dataset to generate predicted salaries for all of our players based on their season statistics.

These predicted salaries will reflect how much each player in the 1997–98 season will be paid as if they are all playing in 2019–20, based on this season’s salary cap! Let’s set up the model for prediction:

# Only include the necessary columns for prediction
data97 = np.array(data97.iloc[1:20])# Use our random forest to predict salary based on 1997 player stats
prediction = rf.predict(data97)# Place our predicted salary array as a new column in our dataset
data97['Predicted Salary'] = prediction.astype(int)
data1997.head()

There you have it — a column with predicted salaries for all 439 players in the 1997–98 NBA season! Finally, we have reached the part we’ve all been waiting for. Let’s find MJ’s predicted salary:

data97.loc[data97['Player'] == 'Michael Jordan*'

Purely based on his stats alone, Michael Jordan will be paid $31,562,081 in today’s game. Let’s see how his salary matches up with today’s highest paid superstars with max contracts:

This chart shows that MJ will hypothetically only be the 15th highest paid player in the NBA, behind Kyrie Irving.

Seems a little undervalued for the greatest player to have stepped foot on a basketball court, but looking back on our model let’s see how Jordan compares with the top 20 highest paid players in the league in terms of some of our most significant stats:

The plots show that MJ contributes to the most wins (WS) for a team and produces the most value compared to an average player (VORP) than any player in the NBA today.

When compared with 2019–20 player performances MJ scored the most points per game (PPG) over all players besides those named Damian Lillard and James Harden.

Given how reliant the Bulls were to MJ when he was on the floor, his usage rate (USG%) still lags behind Harden and Russell Westbrook (which incredibly are on the same team!) which puts into context just how much the Houston Rockets are trying to give these players the ball this season.

Concluding Remarks

Because I thought I worked so hard in obtaining data and building my salary prediction model, I figured I would use this model to find predicted salaries for other legends of the game who played against MJ.

Shaquille O’Neal

Shaq, who was into his second year with the Los Angeles Lakers after signing in 1996, posted 28.3 points per game in the 1997–98 season (second only to Jordan), in addition to 11.4 rebounds per game.

Our model predicts his salary to be $28,028,831. He would go on to win his own three-peat with the late Kobe Bryant after Jordan’s retirement from 2000 to 2002.

Gary Payton

Gary Payton, also known as “The Glove”, would end up with 1998 All-NBA Team honors with season averages of 19.2 points and 8.3 assists per game. Our model predicts his salary to be $34,765,216.

He took his Seattle Supersonics to the Western Conference Semifinals before falling to Shaq’s LA Lakers team.

Scottie Pippen

The Last Dance made me realize that there would be no Michael Jordan without Scottie Pippen. Being an ever-present Robin to Jordan’s Batman, he was severely underpaid, earning just $2,775,000 which ranks 122nd in the NBA and 6th in the Chicago Bulls team that season. The model expects him to earn $8,508,579.

The table below shows the would-be top 10 earners in the 1997–98 season based on our model:

Of course, this prediction model is far from perfect. A player’s salary is not always the best measuring stick on how much value a player brings to the team, but this imperfection brings out the beauty of the NBA itself: how talented general managers around the league are relying on analytics to uncover these hidden gems of players who are undervalued, but bring meaningful contributions to the team.

In 2017, a collective bargaining agreement was reached between the NBA and the Players Association regarding a “supermax” contract (officially known as the Designated Veteran Player Extension) that allowed teams to offer an eligible player up to 35% of its salary cap.

This was designed to help teams retain their players by allowing them to offer significantly more money than the competition — and for our case would tend to overvalue some players compared to others when we only take into account the pure statistics they are producing. Players who have received supermax contracts include Stephen Curry, James Harden, Russell Westbrook and John Wall.

This model also does not take into account a player’s performance in the playoffs. The legends of the game are defined and compared against one another by the number of championships they can win for their team and how they can carry their team in situations when they are needed the most.

If you need another reminder of just how clutch MJ is, the way he won his sixth and final championship for the Bulls with “The Last Shot” really says it all:

Michael Jordan is still the greatest player to ever play the game of basketball, and I believe he still forever will be.