cft

Quit Pandas❌, Use👥 Dask Data frames Instead🤯

In this post, we are going to see how the Dask data frame can be used in place of the Pandas data frame.


user

Ravi kumar

10 months ago | 2 min read

In this post, we are going to see how the Dask data frame can be used in place of the Pandas data frame.

In the last couple of posts, we have seen multiple ways to optimize our Pandas operations.

  • One of the ways is to use “Pandarallel” which enables parallel processing in the pandas data frame
https://blog.devgenius.io/optimizing-pandas-data-frame-operations-using-pandarallel-8824a599123
  • Another way we discussed is to use “modin.pandas” library instead of using “pandas” library
https://blog.devgenius.io/speed-up-pandas-operation-4x-times-just-by-a-single-line-of-code-21e9195d50d7

In this post, we will see the comparison between Dask and Pandas data frames for some operations so that it will be clear how parallel processing helps us to boost performance.

Table of Content:

  • What is Dask and how does it work?
  • Pandas Vs Dask: Loading a 300+MB file
  • Pandas Vs Dask: Group by operation
  • Pandas Vs Dask: Write back to CSV
  • Conclusion

What is Dask and how does it work?

Dask is an open-source library for parallel and distributed computing in Python. It allows users to harness the full power of their CPU and memory resources without the need for complex parallel algorithms or redundant copies of data. Dask is often used to process large amounts of data that don’t fit into memory using familiar APIs from the dask.dataframe libraries.

Dask

Dask works by breaking up large computations into smaller chunks that can be executed in parallel. It then schedules and coordinates these smaller tasks across multiple threads or processors. Dask also uses memory-efficient data structures, such as blocked arrays, to handle large datasets that do not fit into memory. This allows it to perform complex computations on larger-than-memory datasets using familiar APIs from libraries like NumPy and Pandas.

Image Source

Pandas Vs Dask: Loading a 300+MB file

  • Pandas:

%%timeimport pandas as pd ## Importing module

data = pd.read_csv("a.csv") ## Reading the CSV file

Pandas - Output time

Output time: Pandas to load a 300+ MB file

  • Dask:

%%time

import dask.dataframe as dd ## Importing module

data = dd.read_csv("a.csv") ## Reading the CSV file

Dask - Output time

Output time: Dask to load a 300+ MB file

NOTE: Dask is almost 5x times faster than Pandas to load the file.

Pandas Vs Dask: Group by operation

  • Pandas:

%%time

data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'})

Pandas - Output time

Output time: Pandas Group By operation

  • Dask:

%%time

data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'}).compute()

Dask- Output time

Output time: Dask Group By operation

Pandas Vs Dask: Write back to CSV

  • Pandas:

%%time

data.to_csv('Pandas_csv.csv', index = False)

Pandas - Output time

Output time: Pandas to_csv operation

  • Dask:

%%time

data.to_csv('Dask_file_csv.csv', index = False) ## Save in multiple file

Dask - Output time

Output time: Dask to_csv operation (Different file)

But by this method, all the files will be saved in different parts like below:

File_name.part

To save all the data in one file use the below command:

%%time

data.to_csv('Dask-csv.csv', index = False, single_file = True) ## save in one file

Output time: Dask to_csv operation (One file)

Conclusion

Dask Vs Pandas

In some cases like loading and merging data Dask is quite faster than pandas but in terms of data aggregation and sorting pandas is unbeatable.

More about me:

I am a Data Science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.

If this article helped you, don’t forget to Follow, like, and share it with your friends👍Happy Learning!

Upvote


user
Created by

Ravi kumar

Data science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles