Quit Pandas❌, Use👥 Dask Data frames Instead🤯
In this post, we are going to see how the Dask data frame can be used in place of the Pandas data frame.
Ravi kumar
In this post, we are going to see how the Dask data frame can be used in place of the Pandas data frame.
In the last couple of posts, we have seen multiple ways to optimize our Pandas operations.
- One of the ways is to use “Pandarallel” which enables parallel processing in the pandas data frame

- Another way we discussed is to use “modin.pandas” library instead of using “pandas” library

In this post, we will see the comparison between Dask and Pandas data frames for some operations so that it will be clear how parallel processing helps us to boost performance.
Table of Content:
- What is Dask and how does it work?
- Pandas Vs Dask: Loading a 300+MB file
- Pandas Vs Dask: Group by operation
- Pandas Vs Dask: Write back to CSV
- Conclusion
What is Dask and how does it work?
Dask is an open-source library for parallel and distributed computing in Python. It allows users to harness the full power of their CPU and memory resources without the need for complex parallel algorithms or redundant copies of data. Dask is often used to process large amounts of data that don’t fit into memory using familiar APIs from the dask.dataframe libraries.

Dask works by breaking up large computations into smaller chunks that can be executed in parallel. It then schedules and coordinates these smaller tasks across multiple threads or processors. Dask also uses memory-efficient data structures, such as blocked arrays, to handle large datasets that do not fit into memory. This allows it to perform complex computations on larger-than-memory datasets using familiar APIs from libraries like NumPy and Pandas.

Pandas Vs Dask: Loading a 300+MB file
- Pandas:
%%timeimport pandas as pd ## Importing module
data = pd.read_csv("a.csv") ## Reading the CSV file

Output time: Pandas to load a 300+ MB file
- Dask:
%%time
import dask.dataframe as dd ## Importing module
data = dd.read_csv("a.csv") ## Reading the CSV file

Output time: Dask to load a 300+ MB file
NOTE: Dask is almost 5x times faster than Pandas to load the file.
Pandas Vs Dask: Group by operation
- Pandas:
%%time
data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'})

Output time: Pandas Group By operation
- Dask:
%%time
data.groupby(['a', 'b', 'c', 'd']).agg({'e' : 'sum'}).compute()

Output time: Dask Group By operation
Pandas Vs Dask: Write back to CSV
- Pandas:
%%time
data.to_csv('Pandas_csv.csv', index = False)

Output time: Pandas to_csv operation
- Dask:
%%time
data.to_csv('Dask_file_csv.csv', index = False) ## Save in multiple file

Output time: Dask to_csv operation (Different file)
But by this method, all the files will be saved in different parts like below:

File_name.part
To save all the data in one file use the below command:
%%time
data.to_csv('Dask-csv.csv', index = False, single_file = True) ## save in one file

Conclusion

In some cases like loading and merging data Dask is quite faster than pandas but in terms of data aggregation and sorting pandas is unbeatable.
I am a Data Science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.
- Connect with me on LinkedIn.
- Want to read more (Medium): https://medium.com/@ravikumar10593/
- Find my all handles: https://linktr.ee/ravikumar10593
If this article helped you, don’t forget to Follow, like, and share it with your friends👍Happy Learning!
Upvote
Ravi kumar
Data science enthusiast🌺, Learning and exploring how Math, Business, and Technology can help us to make better decisions in the field of data science.

Related Articles