Fun fact: All Euclidean based algorithms are positively affected by scaled data.

Photo by Tingey Injury Law Firm on Unsplash

Yelp Dataset

The Yelp dataset released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses.

In this case, for simplicity we will use the columns “review_count” to predict the number of stars awarded to a business.

import pandas as pd
import json
import matplotlib.pyplot as plt 
import seaborn as snsbiz_file = open('Datasets\yelp_academic_dataset_business.json', encoding = "utf-8")
biz_df = pd.DataFrame([json.loads(x) for x in biz_file.readlines()])
biz_file.close()

Unscaled Feature

When the untouched feature is fed into the model:

fig, ax = plt.subplots(2,1)
fig.set_figheight(15)
fig.set_figwidth(15)
biz_df['review_count'].hist(ax = ax[0], bins = 100)
ax[0].tick_params(labelsize = 14)
ax[0].set_xlabel("review counts", fontsize = 14)
ax[0].set_ylabel("occurance", fontsize = 14)

Some businesses have thousands of reviews compared to small businesses who can still have better stars than them.

Scaled Feature

We need to add one so that the log function does not explode on receiving a 0 as x

biz_df["log_rc"] = np.log(biz_df["review_count"] +1)#continuing the above code
biz_df['log_rc'].hist(ax= ax[1], bins = 100)
ax[1].tick_params(labelsize = 14)
ax[1].set_xlabel("log of review", fontsize = 14)
ax[1].set_ylabel("occurance", fontsize = 14)

Accuracy Model

model = LinearRegression()test_score = cross_val_score(model, df[[" n_tokens_content"]], df[" shares"], cv = 10 )
print(f"R squared score tokens content is {test_score.mean():.5f} +/- {test_score.std():.5f}")
test_score = cross_val_score(model, df[["log_tc"]], df[" shares"], cv = 10 )
print(f"R squared score for log( token content) is {test_score.mean():.5f} +/- {test_score.std():.5f}")

Conclusion

Hence we notice that the accuracy of the model increased with log transformation of the the feature column.

It’s negative as the number of words written is surely not a good measure of how many shares the article might’ve had.

Don’t forget to 👏it would encourage me to write more! :)

Github

https://github.com/HarshitSati/Feature_Engineering