Introduction

In this article, I’d like to give you a quick overview of the most used pandas techniques to manipulate data. This set of techniques will give you the ability to face many problems on how to select specific data, rename the columns, identify missing values or group and sort your data in order to better understand some situations and discover meaningful insight.

After this overview, you will have the fundamental to manipulate any kind of data and you can always come back to this article for a quick review of pandas.

So let’s get started!

Part I: Summary functions, Renaming and Selecting

For this quick review, I will use a dataset that contains the 500 books from 2009 to 2019 sold by Amazon. You can find the dataset on Kaggle.

import pandas as pd
books = pd.read_csv(“bestsellers.csv”)books.head()

out:

Imagine we want to know the statistical situation of the numerical values. The best way to have this insight is by using describe().

books[['User Rating', 'Reviews', 'Price']].describe()

out:

To know the unique values in a specific column, in other words, I don’t want to know the repeated values.

books.Author.unique()

out:

Sometimes it is possible to find that the name of some column presents space like for the column User Rating in our dataset. This situation could create some problems when you manipulate your data in order to find some interesting insights. So, I suggest you rename the column without space as you can see below.

books = books.rename(columns={'User Rating':'User_Rating'})books.head()

out:

Now suppose we want to identify all fiction books that have a user rating higher or equal to four.

books.loc[(books.User_Rating >= 4) & (books.Genre == “Fiction”)]

out:

We could also select all books that are written by Stephen King and George Orwell in this dataset.

books.loc[books.Author.isin([‘Stephen King’, ‘George Orwell’])]

out:

Part II: Missing values, Grouping and Sorting

One of the first step to take before to create a model with your dataset is to identify the missing values and recognize the data types.

books.dtypes

out:

books.isnull().sum()

out:

As you can see above we don’t have missing values in this dataset.

Then, we want to group the authors and for each author having a column with the name of books published. We also need to count the number of each book in the dataset.

bk=books.groupby([‘Author’,’Name’]).Price.agg([‘count’]).tail(15)
pd.DataFrame(bk)

out:

It is possible to sort the author name in descending order with the following function.

bk.sort_values(by=’Author’, ascending=False)

out:

Part III: Combining Data

For this part, I take another dataset from Kaggle. I rename and drop some columns of this dataset in order to make easier the combination of this dataset with the previous one.

new_books.head(10)

out:

all_books = pd.concat([books, new_books])
all_books

With this simple concatenate you add the old dataset with the new dataset. Anyway, there is another technique such as join that, based on a specific common key, permits to merge two datasets.

Conclusion

Now you have an overview of the most used technique when you have to manipulate your dataset and if you will forget how to write a specific function you can come back to this article.