cft

Nitty-Gritty of Advanced Indexing in Pandas

Why Multi-level Indexing is not as daunting as it seems


user

Padhma Sahithya

2 years ago | 4 min read

Photo by Ozgu Ozden on Unsplash
Photo by Ozgu Ozden on Unsplash

There’s this beautiful quote I hear oftentimes,

“Sometimes we don’t see what is in front of us”

That’s so true with everything — be it the cause of your bug or the bug itself, misspelled variable names, not setting inplace=True and wondering why my DataFrame doesn’t change and quite a lot. Not just bugs or typos, even functionalities. We don’t see the basic functionalities as the building blocks of a product. One such thing which is overlooked most of the time but plays a crucial role in Data Analysis and Manipulation is Index in Pandas.

Indexes, Series, and DataFrames are the core Pandas data structures. Indexes are the identifiers or the address to a location in a DataFrame. While manipulating one-dimensional and two-dimensional indexes are the most common practice. Learning to work with an arbitrary number of dimensions can be very handy at times. Luckily, Pandas enables the use and storage of higher-dimensional data into one-dimensional data structures like series and two-dimensional data structures like DataFrames. This is very well known as the Hierarchical indexing, Advanced Indexing, or Multi-level indexing.

In this post, I’ll explain the basic how-tos of Multi-level Indexing in real-world data.

To start off, I have obtained this dataset from Kaggle. The dataset contains some sample data about cars and looks like the one below.

Sample Car Data
Sample Car Data

The Tedious Way:

Let’s say I want to know if there are any Toyota cars with its body style as the sedan and its fuel type as diesel.

This is one of the ways through which we can find out the desired outcome. What if we would like to know the make of all the diesel cars with its body style as the wagon and check if its price falls within $20,000?

We still obtain the desired output but the code is not quite neat. As the query gets more complicated, the code gets more monotonous and lengthy.

Advanced Indexing or Hierarchical Indexing:

Hierarchical Indexing can help us work with an arbitrary number of dimensions. It can help us in filtering, aggregating, organizing, manipulating data for really powerful data analysis.

1) Manipulating Indexes:

Let’s begin by setting indexes for the DataFrame. To create multi-indexed DataFrame we pass a list of values to the index attribute instead of a string. There are so many categorical features in my dataset, but I’ll consider make first as I would like to know the details of the different sets of cars. Also, I’ll consider body_style and fuel_type for better organization.

set_index
set_index

Let’s take a look at the indexes and what they’ve got for us.

multi-level index
multi-level index

The multi-index contains three different attributes namely levels, codes, names.

  • The levels attribute refers to the label names of the indexes. The indexes are ordered from the left. In this case, make is the highest index with its level = 0. It can also be referred to as level = 'make'
  • The codes attribute refers to the labels for each data point that encode the levels.
  • The names attribute contains the string names for the levels.

There can be multi-level indexes for both rows and columns. We can reset the indexes by simply passing the reset_index() function to the DataFrame.

A caveat when working with multi-level indexing is sorting the indexes before attempting to perform any slicing operation. Slicing operations will deliberately throw a warning like the one below.

warning for unsorted multi-index data frame
warning for unsorted multi-index data frame

We can avoid this by, passing the sort_index() function to the multi-indexed DataFrame. Also, Pandas depends on the index being sorted for optimal retrieval and search operations. We can check if the indexes are sorted using the index.is_lexsorted() function which returns a boolean value.

Other useful functions are

  • swap_level()
  • stack()
  • unstack()

swap_level() can be used to interchange the levels within the DataFrame. It takes two parameters i and j which are the level names passed as strings. Passing this function doesn’t alter the order of the values.

stack() and unstack() functions take an additional parameter where you can choose the level to be stacked or unstacked, where the stacked level becomes the new lowest level (rightmost).

You can go ahead and check the official Pandas documentation for more index modifications.

2) Accessing Data:

Let's see if we can find if there are any Toyota cars with its body style as the sedan and its fuel type as diesel using our Multi-Indexed DataFrame.

  • Using .loc
.loc
.loc

.loc is primarily used for label based selection where the label names are passed inside a tuple. Sometimes it is combined with pd.IndexSlice. An example would be finding all diesel cars.

pd.IndexSlice with .loc
pd.IndexSlice with .loc

Notice, inside pd.IndexSlice the first : selects cars of all make and the second : selects cars of all body_style whereas the outer : is to select all columns.

  • Using .xs
.xs
.xs

.xs fetches a particular cross-section of a DataFrame where the axis parameter defaults to zero axis=0 and also takes a level parameter which is an indication of which levels to be used.

  • Using .query
.query
.query

.query uses boolean expressions for filtering or slicing. Notice the use of and and or inside the query over & and |. This is because the expressions inside the query method are evaluated dynamically using the eval() function. Also, this method uses modified Python syntax by default.

  • Using get_level_values
get_level_values
get_level_values

get_level_values uses a vector of label values for the requested level. Here, the level values can either be passed as integers or strings.

More generally, loc and xs are for label based selection while query and get_level_values are helpful in generating conditional expressions for filtering. Not all of these methods work equally, each works best under different circumstances. It is upon us to select the best method of retrieval!

3) Aggregating Data:

All the built-in data aggregation methods such as mean(), min(), max(), sum() can be applied to hierarchical DataFrame. It is passed along with level parameter that influences which part of the data the aggregate is computed on. A simple example would be to find the mean price of all Volkswagen cars.

average price
average price

Working with Pandas Indexes has always been fun and curious! Feel free to check my GitHub repo here that contains all the code mentioned in the article.

Thanks for reading all the way down here. Let me know in the comment section if you have any concerns, feedback, or criticism. Have a good day!


Upvote


user
Created by

Padhma Sahithya


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles