I started my Machine Learning career midway through 2020 and after taking a plethora of online courses, I feel I am ready to apply this knowledge in a real-world problem. It strikes me that all the knowledge I gained off the lectures would be useless unless I gain some practical experience that is actually used in the industry. In light of this, at the beginning of the month, I made a pact to endeavour to build a Machine Learning App.

The dataset used for this app was generated from Stack Overflow's 2021 Developer Survey. Every year, Stack Overflow asks developers what the state of software engineering looks like for them, and tens of thousands of developers worldwide answer. This was the datatset fed to the model for this algorithm.

This model is capable of predicting the annual salary of a software developer based on provided inputs in the prompts.

Here is a step-by-step explanation of my thought process and code used for the model.

The first step taken is to create a virtual environment. To do this, I opened my command prompt (Anaconda's) and ran the following lines of code.

conda create -n [name_of_virtual_env] python=3.8
conda activate [name_of_virtual_env]

While the first line of the above code block creates a new environment, it is only after running the second line that the created environment is activated.

The next step was to open my jupyter notebook

jupyter notebook

On the notebook, a new python file was created and the following code was written on there.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("survey_results_public.csv")

The first two lines imports the libraries, Pandas and Matplotlib respectively. The third line is used to read the csv file. The name of the dataset being used in my model is survey_results_public.csv as gotten from the 2021 Stack Overflow Developer Survey.

Next, it is important to view the said data, as such I displayed the first five rows of the data using the command below.

df.head()

I noticed that there is a lot of missing data - Nan, as such I began to clean data.

Furthermore, I only intend to keep a few columns that will be used for my model. As such, it is imperative that I pick just the parts of the code that are relevant to my model.

df = df[["Country", "EdLevel", "YearsCodePro", "Employment", "ConvertedCompYearly"]]
df = df.rename({"ConvertedComp": "Salary"}, axis=1)
df.head()

The four chosen columns are the country where the developer works, the number of years the developer has worked professionally, the type of employment i.e. remote or full-time or internship etc. and the given annual salary of the software developers in dollars.

Also, I renamed the ConvertedCompYearly column to Salary column for easier identification.

df.info()

This line of code is used to see the details of the data set as shown below.

Next up;

df = df.dropna()
df.isnull().sum()
df = df[df["Employment"] == "Employed full-time"]
df = df.drop("Employment", axis=1)
df.info()

These set of code basically dropped the cells in the dataset with incomplete value. The .dropna() method was used for this rather than filling the datapoints with perhaps the mean value because dropping the value wouldn't be detrimental to the work of the model given that the dataset still contains enough datapoints.

Up next;

The Country column contains string characters and as such, I had to convert it and the countries with not-so-high frequency was classified as Others. Using 400 as the threshold for not-so-high frequency, we have;

def shorten_categories(categories, cutoff):
categorical_map = {}
for i in range(len(categories)):
if categories.values[i] >= cutoff:
categorical_map[categories.index[i]] = categories.index[i]
else:
categorical_map[categories.index[i]] = 'Other'
return categorical_map
country_map = shorten_categories(df.Country.value_counts(), 400)
df['Country'] = df['Country'].map(country_map)
df.Country.value_counts()

In a bid to ensure that the dataset is not skewed, it is important that the salaries less that $10,000 and greater than $25,000 be ignored.

df = df[df["Salary"] <= 250000]
df = df[df["Salary"] >= 10000]
df = df[df['Country'] != 'Other']

Cleaning the Experience column by ensuring values larger than 50 year are returned as 50, whilst values less than 1 year are given as 0.5.

df["YearsCodePro"].unique()
def clean_experience(x):
if x == 'More than 50 years':
return 50
if x == 'Less than 1 year':
return 0.5
return float(x)

df['YearsCodePro'] = df['YearsCodePro'].apply(clean_experience)

Cleaning the Education column by categorizing the unique data-entry.

df["EdLevel"].unique()
def clean_education(x):
if 'Bachelor’s degree' in x:
return 'Bachelor’s degree'
if 'Master’s degree' in x:
return 'Master’s degree'
if 'Professional degree' in x or 'Other doctoral' in x:
return 'Post grad'
return 'Less than a Bachelors'

df['EdLevel'] = df['EdLevel'].apply(clean_education)
df["EdLevel"].unique()

The unique types of Education is then given as;

To make data understandable or in human-readable form, the training data is seen to be labelled in words. Label Encoding is used to convert the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

from sklearn.preprocessing import LabelEncoder
le_education = LabelEncoder()
df['EdLevel'] = le_education.fit_transform(df['EdLevel'])
df["EdLevel"].unique()

le_country = LabelEncoder()
df['Country'] = le_country.fit_transform(df['Country'])
df["Country"].unique()

Splitting data in the features, x and labels, y format.

X = df.drop("Salary", axis=1)
y = df["Salary"]

The next step I took was to train my model and use a suitable model for the dataset (cleaned).

from sklearn.tree import DecisionTreeRegressor
dec_tree_reg = DecisionTreeRegressor(random_state=0)
dec_tree_reg.fit(X, y.values)
y_pred = dec_tree_reg.predict(X)

import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error
error = np.sqrt(mean_squared_error(y, y_pred))
print("${:,.02f}".format(error))

from sklearn.model_selection import GridSearchCV
max_depth = [None, 2,4,6,8,10,12]
parameters = {"max_depth": max_depth}
regressor = DecisionTreeRegressor(random_state=0)
gs = GridSearchCV(regressor, parameters, scoring='neg_mean_squared_error')
gs.fit(X, y.values)
regressor = gs.best_estimator_

regressor.fit(X, y.values)
y_pred = regressor.predict(X)
error = np.sqrt(mean_squared_error(y, y_pred))
print("${:,.02f}".format(error))

The next step is to save the guven model in a pickle file. This is done using the code shown below.

import pickle
data = {"model": regressor, "le_country": le_country, "le_education": le_education}
with open('saved_model.pkl', 'wb') as file:
pickle.dump(data, file)

As seen above, a dictionary is first created. Thereafter, a pickle file, saved_model.pkl is created and the wb implies write binary which is very important.

In addition, I created a requirements.txt file for those looking to import the necessary modules and libraries.

pip freeze > requirements.txt

In my next article, I will be talking about how I deployed the ML app locally using Flask.

The full code in .ipynb format on here. Do well to like this article and keep the discussion going in the comment section.