5 lesser-known Python libraries to improve your Data Science workflow
Hint: No Pandas, Numpy, Scikit-learn in this list
“A star does not compete with other stars around it; it just shines.”
― Matshona Dhliwayo
Python is by far the most popular programming language in the field of Data Science. The rich list of libraries, simple syntax and high productiveness make Python an extremely popular language among beginners as well as seasoned practitioners.
Therefore, it is not unusual to find countless articles praising the power of Python and it’s famous data science libraries like Numpy, Pandas, Tensorflow, Matplotlib, etc.
This blog will try to divert attention to look at some of the lesser-known Python libraries that are slowly gaining recognition among the Data Science community.
Streamlit has been gaining tremendous popularity in recent times. Streamlit has launched only 2 years ago in 2018 and already boasts about being “the fastest way to create data apps” on its platform.
By embracing Python scripting, users can create data apps within minutes. Additionally, UI components like sliders, buttons, widgets, and text boxes can be added with just a single line of code.
The result can be seen here.
Furthermore, the library is compatible with a lot of other major libraries and frameworks like scikit learn, keras, OpenCV, Tensorflow, Pytorch, Numpy, matplotlib etc.
Founded by former executives of Google, Amanda Kelly, Thiago Teixeira, and Adrien Treuille, the company just announced a Series A funding round where it raised $21 million. It’s safe to say that the future looks very promising for Streamlit.
Stars ⭐️ ⭐️ ⭐️ — 10.2k
Forks 🍴 🍴 🍴 — 921
According to their documentation,
tqdm means "progress" in Arabic (taqadum, تقدّم) and is an abbreviation for "I love you so much" in Spanish (te quiero demasiado).
As you might have guessed,
tqdm is a library used to create smart progress meters during your iterative processes. All you have to do is wrap an iterable with
tqdm() , and you’re all set.
Check out the demo here.
tqdm has become extremely popular among Data Scientists. The library is especially used with the Pytorch framework to track the progress of training epochs of neural networks. The next time you are building a Neural Network, do not forget to use this useful library! 😊
Stars ⭐️ ⭐️ ⭐️ — 15.6k
Forks 🍴 🍴 🍴 — 810
This is another library that has recently gained a lot of recognition in the Kaggle community.
pandas-profiling offers HTML profile reports for Pandas data frames. The Github docs mention that the purpose of the library was to provide an upgrade to the normal
df.describe() of the Pandas library.
Here are some of the statistics offered by the library provided it’s relevant for the column:
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap, and dendrogram of missing values
- Text analysis learns about categories (Uppercase, Space), scripts (Latin, Cyrillic), and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates, and dimensions and scan for truncated images or those containing EXIF information.
A simple one-liner like the one above can produce a report like the one here.
One issue with this library is if it is used on very large data, it might take a very long time to create the profile. In some cases, it may even hang.
pandas-profiling is a fantastic library to add to your EDA workflow.
Stars ⭐️ ⭐️ ⭐️ — 5.7k
Forks 🍴 🍴 🍴 — 857
Pycaret is rapidly gaining popularity because of it’s
low-code approach to machine learning. This library acts as a wrapper for popular classical Machine Learning libraries like
Due to this, the
Pycaret library is able to execute complicated tasks such as inter-model performance comparison with a single line of code. The official docs note that one of the key objectives of creating such a library was to reduce costs of startup companies who want to leverage Machine Learning technology.
It is super easy to use and is deployment ready. The documentation also notes that the
Pycaret pipelines can be saved as binary files to help securely transfer them between machines and environments.
By using just the two lines of code above, you can get a detailed view of the features outcomes of preprocessing techniques like the following table:
Similarly, to compare multiple models you just need to write a single line of code:
As you can see,
Pycaret offers users with incredible power to create powerful Machine Learning workflows using very little code. Use
Pycaret to develop machine learning algorithms at lightning speed! ⚡️⚡️⚡️
Stars ⭐️ ⭐️ ⭐️ — 2.1k
Forks 🍴 🍴 🍴 — 419
5. cudf and cuml by RAPIDS
Okay, so the last one is actually two libraries. However, the RAPIDS project is the new talk of the town for machine learning libraries. The reason being is that it offers the niche feature of being able to use GPUs to train classical Machine Learning models.
Traditionally, GPUs are used to speed up Deep Learning models. However, with the introduction of the RAPIDS project, the powerful computational capabilities of GPUs can now be extended to classical models.
cuml use the CUDA programming model on the lower level to provide accelerated workflows when working with data frames and machine learning models.
cudf offers an API that looks very similar to Pandas. Here is some sample code:
Likewise, the API offered by
cuml is very similar to scikit-learn:
According to the RAPIDS documentation, the
cuml implementations can speed up training by 10–50x as compared to CPU equivalents when training on large datasets. It has many more features available. Don’t forget to check it out and give it a spin the next time you are working on a Machine Learning project!
cudf Github Stats:
Stars ⭐️ ⭐️ ⭐️ — 3.1k
Forks 🍴 🍴 🍴 — 419
cuml Github Stats:
Stars ⭐️ ⭐️ ⭐️ — 1.6k
Forks 🍴 🍴 🍴 — 256
Hi, I work for a start up based out of Bangalore. I'm extremely passionate about Data Science. Feel free to reach out to me if you want to talk about data, movies/shows, music, football (soccer for americans).