In the last few weeks I tried to make a Data Science Mini Project related to Machine Learning. After thinking for a long time, I finally decided to make a Machine Learning Model that can predict whether an Instagram Influencers Engagement will growing or declining in the following month.

This Mini Project is an end-to-end project and this article will be divided into 4 section which are:

Retrieving data from Instagram influencers using Selenium and Beautiful Soup.
Preprocessing data starts from data cleansing, feature engineering, feature selection and etc until the data is ready to be consumed by the Machine Learning model.
Modeling uses Machine Learning Algorithm (Linear Regression, Random Forest, XGBoost) and also do some Tuning Hyperparamaters.
Interpreting of results from prediction output of Machine Learning.

So that after reading this article, I hoped the reader can get some of the knowledge related to Acquiring External Data, Preprocessing Data, and Machine Learning Model. And ready to start your own Mini Machijne Learning Project!.

To follow this tutorial, you should at least know about:
1. Basic programming in Python.
2. Pandas and Numpy libraries for data analysis tools.
3. Matplotlib and Seaborn libraries for data visualization.
4. Scikit-Learn Library for Machine Learning.
5. Selenium and Beautiful Soup libraries for acquiring instagram data
6. Jupyter Notebook.

Full Dataset and code can be downloaded at my Github and all work is done on Jupyter Notebook.

1 Acquiring data from Instagram influencers using Selenium and Beautiful Soup.

Step 1 is the most time-consuming step because it takes a long time to retrieve influencers data on Instagram. Step 1 consists of 3 stages:

1.1 Take a list of influencers that will be predicted. I took the top 1000 influencers from Indonesia (source starngage).

#Create Empty List
ranking = []
username = []
category = []
category_2 = []#Function to scrape username information
def scrape_username(url):

#accessing and parsing the input url
response = requests.get(url)
print(f'page {a} respose {response}')
soup = BeautifulSoup(response.content, 'html.parser')
list_username = soup.find_all('tr')

#looping to the element that we want to scrape
for p in list(list_username):
try:
#getting the information (rank, names, and category)
rank = p.find('td', 'align-middle').get_text().strip()
ranking.append(rank)
name = p.find('a').get_text().strip()
username.append(name)
cat = p.find_all('span', 'badge badge-pill badge-light samll text-muted')
category_2 = []
for c in cat:
d = c.find('a', 'link').get_text()
category_2.append(d)
category.append(category_2)
except:
continue

Sneak Peak Output of Step 1.1

1.2 Take post links from every influencer on Instagram using Selenium.

#Create Empty List
link = []
names = []#Function to get Post Link
def get_influencer_link(username): #to influencer url
url = f'https://www.instagram.com/{username}/'
driver = webdriver.Chrome()
driver.get(url) time.sleep(5) i = 0
while i < 8:
try:
#get the links
pages = driver.find_elements_by_tag_name('a')
for data in pages:
data_2 = data.get_attribute("href")
if '/p/' in data_2:
link.append(data.get_attribute("href"))
names.append(name) # Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page
time.sleep(1)
i += 1
except:
i += 1
continue
driver.quit()

Sneak Peak Output of Step 1.2

1.3 Retrieve information from each post such as number of likes, number of comments, captions etc. using Beautiful Soup.

#Create Empty List
likes = []
comment_counts = []
dates = []
captions = []
type_posts = []
links = []#Function to get information
def get_information(link):
try:
global i, n

#accessing and parsing the website url
url = link
response = requests.get(url)
soup = BeautifulSoup(response.content)

#find element that contain information
body = soup.find('body')
script = body.find('script')
raw = script.text.strip().replace('window._sharedData =', '').replace(';', '')
json_data=json.loads(raw)
posts =json_data['entry_data']['PostPage'][0]['graphql']
posts= json.dumps(posts)
posts = json.loads(posts)

#acquiring information
like = posts['shortcode_media']['edge_media_preview_like']['count']
comment_count = posts['shortcode_media']['edge_media_to_parent_comment']['count']
date = posts['shortcode_media']['taken_at_timestamp']
caption = posts['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
type_post = posts['shortcode_media']['__typename']
likes.append(like)
comment_counts.append(comment_count)
dates.append(date)
captions.append(caption)
type_posts.append(type_post)
links.append(link)
i += 1
except:
i += 1
n += 1
print(f'number of link error {n} at iteration {i}')
pass

Sneak Peak Output of Step 1.3

2 Preprocessing data starts from data cleansing, feature engineering, feature selection until the data is ready to be consumed by the Machine Learning model.

Step 2 consists of 3 stages, namely Data Cleansing, Feature Engineering, and Feature Selection.

2.1 Data Cleansing consists of 2 things, namely converting the date feature format which is still in epoch time to the datetime and cleansing feature captions.

#convert epoch time --> datetime
#The format is year-month-day-hour
df['dates'] = df['dates'].apply(lambda x: dt.datetime.fromtimestamp(x).strftime('%Y-%m-%d-%H'))#remove unused characters in feature captions
df['captions'] = df['captions'].replace(r'[\n]', '', regex=True)#fill missing value in features captions
df['captions'] = df['captions'].fillna('no captions')

Sneak Peak Output of Step 2.1

2.2 This step is Feature Engineering where from 10 features can produce 52 new features. This stage also made the base table for the modeling process.

#create features lag of n_post (last 3 month)
#number of n_post 1 months ago
base_table['n_post_01'] = base_table.groupby(['username'])['n_post'].shift(1).fillna(0)#number of n_post 2 months ago
base_table['n_post_02'] = base_table.groupby(['username'])['n_post'].shift(2).fillna(0)#number of n_post 3 months ago
base_table['n_post_03'] = base_table.groupby(['username'])['n_post'].shift(3).fillna(0)

Sneak Peak Output of Step 2.2

2.3 The next step is Feature Selection, I am still using the simple Feature Selection. The method I use is to look at the Correlation Coefficient between the Predictor and the Target Features.

#I choose variables with the value of correlation coefficient r < -0.2 and r > 0.3
#it is very subjective matter
plt.figure(figsize=(10,8))
sns.heatmap(df.corr());

Heatmap Correlation

From total 62 Features, I took only about 20 Features that I will use for the modeling process.

3 Modeling uses Machine Learning Algorithm (Linear Regression, Random Forest, XGBoost) and also do some Tuning Hyperparamaters.

I did a number of scenarios in the modeling process with purposes to find the best model, That scenarios are:
1. Modelling without Feature Selection and without Tuning Hyperparameters.
2. Modelling without Feature Selection and with Tuning Hyperparameters.
3. Modelling with Feature Selection and without Tuning Hyperparameters.
4. Modelling with Feature Selection and wit Tuning Hyperparameters.

Output of Modelling Process

There are total 10 scenarios complete with their performance on the data training, data testing and data training+data testing.

Metrics evaluation that I used is only on Root Mean Squared Error (RMSE). From all scenarios, Random Forest with Features Selection and with Tuning Hyperparameters give the best result in term if RMSE train, RMSE test and RMSE all. So i choose this model as a final model to make prediction.

4 Interpreting of results from prediction output of Model Final Machine Learning.

Prediction results show that in July there will be 513 Instagram Influencers that are Average Total Engagement will Growing and 123 will be Declining.

Predicted Category Composition

If we look at the results of our Best Model Features Importance, namely Random Forest, it can be seen that the features related to likes and engagements have high relative importance.

Feature Importance

Finally, also completed an article related to Instagram Influencers Prediction. If I can conclude, we’ve done a number of things:

1. Acquiring External Data from Instagram
2. Doing Some Data Preprocessing
3. Modeling with Machine Learning Algorithm.
4. Interpreting Results