Python: A Quick Review of Pandas
An overview of some useful pandas techniques
Kouate
Introduction
In this article, I’d like to give you a quick overview of the most used pandas techniques to manipulate data. This set of techniques will give you the ability to face many problems on how to select specific data, rename the columns, identify missing values or group and sort your data in order to better understand some situations and discover meaningful insight.
After this overview, you will have the fundamental to manipulate any kind of data and you can always come back to this article for a quick review of pandas.
So let’s get started!
Part I: Summary functions, Renaming and Selecting
For this quick review, I will use a dataset that contains the 500 books from 2009 to 2019 sold by Amazon. You can find the dataset on Kaggle.
import pandas as pd
books = pd.read_csv(“bestsellers.csv”)books.head()
out:

Imagine we want to know the statistical situation of the numerical values. The best way to have this insight is by using describe().
books[['User Rating', 'Reviews', 'Price']].describe()
out:

To know the unique values in a specific column, in other words, I don’t want to know the repeated values.
books.Author.unique()
out:

Sometimes it is possible to find that the name of some column presents space like for the column User Rating in our dataset. This situation could create some problems when you manipulate your data in order to find some interesting insights. So, I suggest you rename the column without space as you can see below.
books = books.rename(columns={'User Rating':'User_Rating'})books.head()
out:

Now suppose we want to identify all fiction books that have a user rating higher or equal to four.
books.loc[(books.User_Rating >= 4) & (books.Genre == “Fiction”)]
out:

We could also select all books that are written by Stephen King and George Orwell in this dataset.
books.loc[books.Author.isin([‘Stephen King’, ‘George Orwell’])]
out:

Part II: Missing values, Grouping and Sorting
One of the first step to take before to create a model with your dataset is to identify the missing values and recognize the data types.
books.dtypes
out:

books.isnull().sum()
out:

As you can see above we don’t have missing values in this dataset.
Then, we want to group the authors and for each author having a column with the name of books published. We also need to count the number of each book in the dataset.
bk=books.groupby([‘Author’,’Name’]).Price.agg([‘count’]).tail(15)
pd.DataFrame(bk)
out:

It is possible to sort the author name in descending order with the following function.
bk.sort_values(by=’Author’, ascending=False)
out:

Part III: Combining Data
For this part, I take another dataset from Kaggle. I rename and drop some columns of this dataset in order to make easier the combination of this dataset with the previous one.
new_books.head(10)
out:

all_books = pd.concat([books, new_books])
all_books

With this simple concatenate you add the old dataset with the new dataset. Anyway, there is another technique such as join that, based on a specific common key, permits to merge two datasets.
Conclusion
Now you have an overview of the most used technique when you have to manipulate your dataset and if you will forget how to write a specific function you can come back to this article.
Upvote
Kouate
Data Analyst and Data Lover

Related Articles