In this post, we are going to see how the Dask data frame can be used in place of the Pandas data frame.

In the last couple of posts, we have seen multiple ways to optimize our Pandas operations.

One of the ways is to use “Pandarallel” which enables parallel processing in the pandas data frame

https://blog.devgenius.io/optimizing-pandas-data-frame-operations-using-pandarallel-8824a599123

Another way we discussed is to use “modin.pandas” library instead of using “pandas” library

https://blog.devgenius.io/speed-up-pandas-operation-4x-times-just-by-a-single-line-of-code-21e9195d50d7

In this post, we will see the comparison between Dask and Pandas data frames for some operations so that it will be clear how parallel processing helps us to boost performance.

Table of Content:

What is Dask and how does it work?
Pandas Vs Dask: Loading a 300+MB file
Pandas Vs Dask: Group by operation
Pandas Vs Dask: Write back to CSV
Conclusion

What is Dask and how does it work?

Dask is an open-source library for parallel and distributed computing in Python. It allows users to harness the full power of their CPU and memory resources without the need for complex parallel algorithms or redundant copies of data. Dask is often used to process large amounts of data that don’t fit into memory using familiar APIs from the dask.dataframe libraries.

Dask works by breaking up large computations into smaller chunks that can be executed in parallel. It then schedules and coordinates these smaller tasks across multiple threads or processors. Dask also uses memory-efficient data structures, such as blocked arrays, to handle large datasets that do not fit into memory. This allows it to perform complex computations on larger-than-memory datasets using familiar APIs from libraries like NumPy and Pandas.

Pandas Vs Dask: Loading a 300+MB file

Pandas:


%%timeimport pandas as pd  ## Importing module
data = pd.read_csv("a.csv")  ## Reading the CSV file

Output time: Pandas to load a 300+ MB file

Dask:


%%time
import dask.dataframe as dd  ## Importing module
data = dd.read_csv("a.csv")  ## Reading the CSV file

Output time: Dask to load a 300+ MB file

NOTE: Dask is almost 5x times faster than Pandas to load the file.

Pandas Vs Dask: Group by operation

Pandas:


%%time
data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'})

Output time: Pandas Group By operation

Dask:


%%time
data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'}).compute()

Output time: Dask Group By operation

Pandas Vs Dask: Write back to CSV

Pandas:


%%time
data.to_csv('Pandas_csv.csv', index = False)

Output time: Pandas to_csv operation

Dask:


%%time
data.to_csv('Dask_file_csv.csv', index = False)  ## Save in multiple file

Output time: Dask to_csv operation (Different file)

But by this method, all the files will be saved in different parts like below:

File_name.part

To save all the data in one file use the below command:


%%time
data.to_csv('Dask-csv.csv', index = False, single_file = True) ## save in one file

Output time: Dask to_csv operation (One file)

Conclusion

In some cases like loading and merging data Dask is quite faster than pandas but in terms of data aggregation and sorting pandas is unbeatable.

More about me:

I am a Data Science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.

Connect with me on LinkedIn.
Want to read more (Medium): https://medium.com/@ravikumar10593/
Find my all handles: https://linktr.ee/ravikumar10593

If this article helped you, don’t forget to Follow, like, and share it with your friends👍Happy Learning!