Importance of Scaling your data with example— Log Transformations

With the world everchanging, the data is not stable too, data changes every second, in a moment the scale of your data can go off charts and void your ML model, hence it is always better to scale your data, check out the blog for interesting examples (for python users)


Harshit Sati

2 years ago | 1 min read

Fun fact: All Euclidean based algorithms are positively affected by scaled data.

Photo by Tingey Injury Law Firm on Unsplash
Photo by Tingey Injury Law Firm on Unsplash

Yelp Dataset

The Yelp dataset released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses.

In this case, for simplicity we will use the columns “review_count” to predict the number of stars awarded to a business.

import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as snsbiz_file = open('Datasets\yelp_academic_dataset_business.json', encoding = "utf-8")
biz_df = pd.DataFrame([json.loads(x) for x in biz_file.readlines()])

Unscaled Feature

When the untouched feature is fed into the model:

fig, ax = plt.subplots(2,1)
biz_df['review_count'].hist(ax = ax[0], bins = 100)
ax[0].tick_params(labelsize = 14)
ax[0].set_xlabel("review counts", fontsize = 14)
ax[0].set_ylabel("occurance", fontsize = 14)

Some businesses have thousands of reviews compared to small businesses who can still have better stars than them.

Scaled Feature

We need to add one so that the log function does not explode on receiving a 0 as x

biz_df["log_rc"] = np.log(biz_df["review_count"] +1)#continuing the above code
biz_df['log_rc'].hist(ax= ax[1], bins = 100)
ax[1].tick_params(labelsize = 14)
ax[1].set_xlabel("log of review", fontsize = 14)
ax[1].set_ylabel("occurance", fontsize = 14)

Accuracy Model

model = LinearRegression()test_score = cross_val_score(model, df[[" n_tokens_content"]], df[" shares"], cv = 10 )
print(f"R squared score tokens content is {test_score.mean():.5f} +/- {test_score.std():.5f}")
test_score = cross_val_score(model, df[["log_tc"]], df[" shares"], cv = 10 )
print(f"R squared score for log( token content) is {test_score.mean():.5f} +/- {test_score.std():.5f}")


Hence we notice that the accuracy of the model increased with log transformation of the the feature column.

It’s negative as the number of words written is surely not a good measure of how many shares the article might’ve had.

Don’t forget to 👏it would encourage me to write more! :)



Created by

Harshit Sati

Hey, I am a 3rd year Computer Science student who's eager collaborate/ to learn more about Machine learning, AI and thier daily applications. I am very active on various data science communities and platforms. I am a content writer and have written different data science notebooks and medium blogs on data science and AI. From learning different aspects of data science to collaborating with other data scientists through a platform like Kaggle, Udacity and sharing my knowledge with other peers through blogging on medium, I am committed to continuous learning and growing on every aspect be collaborating, sharing and implementing things. I am currently learning MlOps right now. My dream is to help conserve marine mammals and our environment through data science. Big Data is the key to addressing some of the biggest concerns about the environment and I believe it can help us save the environment. Follow me on: - Medium:







Related Articles