There are many tools right now that we could be using for our Data Science projects, but how many do you know actually inside and out?

In order for us to use those tools, there are developers who are contributing to open source projects like the ones I am going to show you today.

These are going to be pretty popular projects for which you might not have even known they are open source. My goal is to get you interested in those and maybe even contribute to them!

Julia

This amazing project has over 1,000 contributors!

Julia is a high-level, high-performance dynamic language for technical computing. It is used in many different fields in Computer Science like Data Visualization, Data Science, Machine Learning, Parallel Computing, etc.

Julia has been downloaded over 17 million times and the community has registered over 4,000 Julia packages for community use.

These include various mathematical libraries, data manipulation tools, and packages for general-purpose computing. In addition to these, you can easily use libraries from Python, R, C/Fortran, C++, and Java.

Scikit learn

There are almost 2,000 contributors for this project!

Scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed.

If you want to learn about the contributing process, they have a page tailored for the contributing process. They are really open to it and welcome new contributors every single day!

There are many ways to contribute to scikit-learn, with the most common ones being the contribution of code or documentation to the project. Improving the documentation is no less important than improving the library itself.

If you find a typo in the documentation or have made improvements, do not hesitate to send an email to the mailing list or preferably submit a GitHub pull request.
Another way to contribute is to report issues you’re facing, and give a “thumbs up” on issues that others reported and that are relevant to you.

This module is used for Machine Learning and inside the field for so many operations like Classification, Regression, Clustering, Dimensionality reduction, Model selection, Data Preprocessing, etc.

Amongst other things, it is accessible to everybody and reusable in various contexts.

Apache Mahout

Contributing to an Apache project is about more than just writing code — it’s about doing what you can to make the project better. There are lots of ways to contribute!

Apache Mahout(TM) is a distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms.

Apache Spark is the recommended out-of-the-box distributed back-end or can be extended to other distributed backends.

Mathematically Expressive Scala DSL
Support for Multiple Distributed Backends (including Apache Spark)
Modular Native Solvers for CPU/GPU/CUDA Acceleration

Ways to contribute

There are a ton of things in Mahout that we would love to have contributions to documentation, performance improvements, better tests, etc.

The best place to start is by looking into our issue tracker and seeing what bugs have been reported and see if any look like you could take them on. Small, well written, well-tested patches are a great way to get your feet wet. It could be something as simple as fixing a typo.

The more important piece is you are showing you understand the necessary steps for making changes to the code. Mahout is a pretty big beast at this point, so changes, especially from non-committers, need to be evolutionary, not revolutionary since it is often very difficult to evaluate the merits of a very large patch.

Natural Language Toolkit

NLTK — the Natural Language Toolkit — is a suite of open-source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.

The NLTK organization on GitHub has many repositories, so they can manage better the issues and development. The most important are:

nltk/nltk, the main repository with code related to the library;
nltk/nltk_data, repository with data related to corpora, taggers, and other useful data that are not shipped by default with the library, which can be downloaded by nltk.downloader;
nltk/nltk.github.com, NLTK website with information about the library, documentation, link for downloading NLTK Book, etc.;
nltk/nltk_book, source code for the NLTK Book.

H2O.ai

H2O.ai is the open source leader in AI and machine learning with a mission to democratize AI for everyone. Industry-leading enterprise-ready platforms are used by hundreds of thousands of data scientists in over 20,000 organizations globally.

They empower every company to be an AI company in financial services, insurance, healthcare, telco, retail, pharmaceutical, and marketing and delivering real value and transforming businesses today.

Ways to contribute

H2O has been built by a great number of contributors over the years both within H2O.ai (the company) and the greater open source community. You can begin to contribute to H2O by answering Stack Overflow questions or filing bug reports.