Before we start

Do not forget to follow me on my GitHub, LinkedIn and Medium accounts.

I love to write about Data Science and to share cool stuff with people on the internet.

Easing your EDA

How about having an easier way to start your Exploratory Data Analysis (EDA) and make data reports that give you great insights? Sounds nice, eh?

With Pandas Profiling, that is possible.

You might be asking yourself what Pandas Profiling is. No, it’s not a bunch of Chinese pandas computing data.

Pandas Profiling is an open-source python library, which allows you to do your EDA very quickly.

By the way, it also generates an interactive HTML report, which you can show to anyone. Imagine going to your boss, who doesn’t code, with an interactive description of the company’s data. Great for your branding, right?

These are some of the things you get in your report:

Type inference: detect the types of columns in a Data Frame.
Essentials: type, unique values, missing values.
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range.
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.
Most frequent values.
Histogram.
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices.
Missing values matrix, count, heat-map and dendrogram of missing values.
Text analysis learns about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

Given this, let’s get going.

First of all, you need to install the package.

#installing Pandas Profiling

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip -q

Now, let’s import both pandas and panda_profiling.

#importing modules

from pandas_profiling import ProfileReport

import pandas as pd

We will be using the Titanic dataset to complete our analysis, let’s import it:

#linking df to our dataset

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

After you import it, you should always take a look at your dataset and then merely link report to it:

report = ProfileReport(df)

Now you simply have “to tell” Pandas Profiling to make a report out of your dataset.

report.to_notebook_iframe()

There you go. As simple as that. You can check the result here.

If you use a Jupyter Notebook, your report is embedded in it. However, you may want to use it in other places.

Pandas Profiling also allows you to do that. Just type this to save your report as an HTML file:

report.to_file('file_name')

If you want the HTML source “code” (don’t kill me for calling it code), which would be quite rare, however possible, just type:

report.to_html()

You can even save it as a JSON file:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Conclusion

Today you learned the basics (it doesn’t get much more complex than that) of Pandas Profiling, a simple, however powerful tool.

In your report you will have the following sections:

Overview.
Variables.
Variables.
Correlations.
Missing Values.
Sample

With four lines of code, you can have this beautiful report.

If I were you, it would totally be on my tool list for my Data Analysis routine. It just makes your work much more dynamic.

It can even save you a couple of hours.

Not to say that the report is beautiful, minimalist and interactive, making it easy for anyone who takes a look at it to understand.

By the way, you can also edit the report, however, that’s something for another post. :)

Reference:

https://github.com/pandas-profiling/pandas-profiling