Nitty-Gritty of Advanced Indexing in Pandas
Why Multi-level Indexing is not as daunting as it seems
There’s this beautiful quote I hear oftentimes,
“Sometimes we don’t see what is in front of us”
That’s so true with everything — be it the cause of your bug or the bug itself, misspelled variable names, not setting inplace=True and wondering why my DataFrame doesn’t change and quite a lot. Not just bugs or typos, even functionalities. We don’t see the basic functionalities as the building blocks of a product. One such thing which is overlooked most of the time but plays a crucial role in Data Analysis and Manipulation is Index in Pandas.
Indexes, Series, and DataFrames are the core Pandas data structures. Indexes are the identifiers or the address to a location in a DataFrame. While manipulating one-dimensional and two-dimensional indexes are the most common practice. Learning to work with an arbitrary number of dimensions can be very handy at times. Luckily, Pandas enables the use and storage of higher-dimensional data into one-dimensional data structures like series and two-dimensional data structures like DataFrames. This is very well known as the Hierarchical indexing, Advanced Indexing, or Multi-level indexing.
In this post, I’ll explain the basic how-tos of Multi-level Indexing in real-world data.
To start off, I have obtained this dataset from Kaggle. The dataset contains some sample data about cars and looks like the one below.
The Tedious Way:
Let’s say I want to know if there are any Toyota cars with its body style as the sedan and its fuel type as diesel.
This is one of the ways through which we can find out the desired outcome. What if we would like to know the make of all the diesel cars with its body style as the wagon and check if its price falls within $20,000?
We still obtain the desired output but the code is not quite neat. As the query gets more complicated, the code gets more monotonous and lengthy.
Advanced Indexing or Hierarchical Indexing:
Hierarchical Indexing can help us work with an arbitrary number of dimensions. It can help us in filtering, aggregating, organizing, manipulating data for really powerful data analysis.
1) Manipulating Indexes:
Let’s begin by setting indexes for the DataFrame. To create multi-indexed DataFrame we pass a list of values to the
index attribute instead of a string. There are so many categorical features in my dataset, but I’ll consider
make first as I would like to know the details of the different sets of cars. Also, I’ll consider
fuel_type for better organization.
Let’s take a look at the indexes and what they’ve got for us.
The multi-index contains three different attributes namely levels, codes, names.
levelsattribute refers to the label names of the indexes. The indexes are ordered from the left. In this case,
makeis the highest index with its
level = 0. It can also be referred to as
level = 'make'
codesattribute refers to the labels for each data point that encode the levels.
namesattribute contains the string names for the levels.
There can be multi-level indexes for both rows and columns. We can reset the indexes by simply passing the
reset_index() function to the DataFrame.
A caveat when working with multi-level indexing is sorting the indexes before attempting to perform any slicing operation. Slicing operations will deliberately throw a warning like the one below.
We can avoid this by, passing the
sort_index() function to the multi-indexed DataFrame. Also, Pandas depends on the index being sorted for optimal retrieval and search operations. We can check if the indexes are sorted using the
index.is_lexsorted() function which returns a boolean value.
Other useful functions are
swap_level() can be used to interchange the levels within the DataFrame. It takes two parameters
j which are the level names passed as strings. Passing this function doesn’t alter the order of the values.
unstack() functions take an additional parameter where you can choose the level to be stacked or unstacked, where the stacked level becomes the new lowest level (rightmost).
You can go ahead and check the official Pandas documentation for more index modifications.
2) Accessing Data:
Let's see if we can find if there are any Toyota cars with its body style as the sedan and its fuel type as diesel using our Multi-Indexed DataFrame.
.loc is primarily used for label based selection where the label names are passed inside a tuple. Sometimes it is combined with
pd.IndexSlice. An example would be finding all diesel cars.
pd.IndexSlice the first
: selects cars of all
make and the second
: selects cars of all
body_style whereas the outer
: is to select all columns.
.xs fetches a particular cross-section of a DataFrame where the axis parameter defaults to zero
axis=0 and also takes a
level parameter which is an indication of which levels to be used.
.query uses boolean expressions for filtering or slicing. Notice the use of
or inside the query over
|. This is because the expressions inside the query method are evaluated dynamically using the
eval() function. Also, this method uses modified Python syntax by default.
get_level_values uses a vector of label values for the requested level. Here, the level values can either be passed as integers or strings.
xs are for label based selection while
get_level_values are helpful in generating conditional expressions for filtering. Not all of these methods work equally, each works best under different circumstances. It is upon us to select the best method of retrieval!
3) Aggregating Data:
All the built-in data aggregation methods such as
sum() can be applied to hierarchical DataFrame. It is passed along with
level parameter that influences which part of the data the aggregate is computed on. A simple example would be to find the mean price of all Volkswagen cars.
Working with Pandas Indexes has always been fun and curious! Feel free to check my GitHub repo here that contains all the code mentioned in the article.
Thanks for reading all the way down here. Let me know in the comment section if you have any concerns, feedback, or criticism. Have a good day!