If you are an AI/ML/DS researcher or practitioner, you already know that every interesting project is usually replete with challenges accompanied with making numerous mistakes until hopefully discovering the best way of solving the problem. This is not only limited to coding or modeling choices but also on how to approach a particular problem and decide the most effective and realistic road map. In this series of posts, I will present and discuss the highlights of my experience in these areas.

I will start the first post with code snippets of performing various tasks in the best possible way I know (I will appreciate your comments to make sure we have the most elegant solutions in the repository). I hope this will turn into a useful resource for better understanding of concepts and for more elegant and consistent AI/ML/DS code.

These posts will be in three main areas: 1) hard to grasp concepts that took me a while to fully understand, 2) programming constructs for the most elegant expressive ways of performing a certain task, and 3) the most efficient implementation tools and techniques a AI practitioner should know about.

To keep things simple, the first post discusses some basic programming and data wrangling topics. This will be followed with more advanced AI topics in the upcoming articles. I will try to keep each post and each discussed topic as concise as possible to ensure my readers are not lost in the irrelevant details.

Functional Programming for Clean, Expressible and Compact Code

Use itertools: Itertools is a great library. However, many data science practitioners are not aware of it or at least not quite comfortable using it. Here, I present two simple examples and will introduce more in the other posts. The first one is simply using itertools to avoid using nested loops. The second one is to generate a Cartesian product of several lists to populate a data frame from a dictionary.

import itertools as it
import pandas as pd

# You can use this instead of a nested loop
for i, j, k in it.product(range(4), range(5), range(3, 7)):
print(i, j , k)

# Generate a Cartesian product of several lists to populate a data frame from a source dictionary.
# You can use the same construct for generating both data and multi-index.
index_dict = {'First': ['A', 'B', 'C', 'D'], 'Second': ['I', 'II'], 'Third': ['1', '2', '3']}
grid_index = pd.MultiIndex.from_tuples(it.product(*index_dict.values()), names=index_dict.keys())

source_dict = {'Name': ['John', 'Jack', 'Adam', 'Mehdi', 'David', 'Ariel'], 'Age': [20, 30, 40, 50]}
grid_dataframe = pd.DataFrame.from_records(it.product(*source_dict.values()), columns=source_dict.keys(), index=grid_index)
# grid_dataframe.set_index(grid_index, inplace=True)

print(grid_dataframe)

Comprehensions with ternary operator: Even the most basic functional programming ideas and constructs can help develop cleaner, more readable and more compact code. Comprehension (set, list, dictionary) is one of these constructs. In the following example, we first filter a list of integers (0, 1,..., 13, 14) by removing all the numbers divisible by 3 and then multiply the even numbers by two and negate the remaining ones. The first solution is by using filter and map functions but the solution using a list comprehension combined by filtering and ternary operator is cleaner and more readable.

# Coversion using 'map' and 'filter' functions.
list1 = list(map(lambda x: x*2 if x%2==0 else -x, filter(lambda x: x%3!=0, range(15))))

# A more expressive way of achieving the same results.
list2 = [x*2 if x%2==0 else -x for x in range(15) if x%3!=0]

print(list1)
print(list2)

Word to index using functional constructs: Word embedding is an essential part of most Deep Learning based NLP applications. For this, we usually use Embedding layers from various Deep Learning frameworks. Underneath this layer, we need to construct a dictionary that creates a continuous index for all the words in our vocabulary.

import re

# Create list of words
document = 'During my few years as an ML/DS researcher and practitioner, there has been lots of ups and down, making many mistakes and finally finding out what would be the best way of doing a certain task. This is not only limited to coding or modeling choices but also how to approach a particular problem or deciding the most useful and realistic road map. In this story, I will try to summarize the highlights in both technical and roadmap. I am going to to start with code snippets of performing very common tasks in the best possible way I know and I will appreciate your comments to make sure we have the best possible one. I hope this will turn into a useful repository for better understanding of concepts and for elegant data science code. In particular, many of us are used to applying API from various libraries without really thinking how things done under the hood. I will start this journey by some small examples an see where the road will lead us.'
list_of_words = re.split('[.,]+\s*|\s+', document)

# Create 'word to index' and 'index to word' dictionaries
set_of_words = set(list_of_words)
num_of_words = len(set_of_words)
word_index = dict(zip(set_of_words, range(num_of_words)))
index_word = dict(zip(range(num_of_words), set_of_words))

print(word_index)
print(index_word)

2. Data Wrangling

Pandas accessors for efficient and readable code: There are three useful accessors available in Pandas for 3 different data types: datetime (.dt), string (.str) and categories (.cat). Since methods available through these accessors are vectorized, they are very efficient and you should always check if your desired functionality is available through these accessors.

import pandas as pd

df = pd.DataFrame({'Date': pd.date_range('2020-04-08', periods=5, freq='M'),
'Name': ['Mehdi-SD', 'Saeed-SD', 'Elshan-YP', 'Ariel-SB', 'Parisa-MH'],
'Score': pd.Categorical(['A', 'B', 'A', 'C', 'B'], categories=['A', 'B', 'C', 'D', 'E', 'F'], ordered=True)})

# .dt accessor examples
print(f"\n =========== .dt Accessor ========= \
\n Date: {df['Date'].dt.date.tolist()}, \
\n Month Name: {df['Date'].dt.month_name().tolist()}, \
\n Day Name: {df['Date'].dt.day_name().tolist()}, \
\n Days in month: {df['Date'].dt.days_in_month.tolist()}, \
\n Converted to Python datetime: {df['Date'].dt.to_pydatetime().tolist()}, \
\n Convert to monthly periods: {df['Date'].dt.to_period('M').tolist()}")

# .str accessor
print(f"\n =========== .str Accessor ========= \
\n Upper-cased: {df['Name'].str.upper().tolist()}, \
\n Last name extracted: {df['Name'].str.extract(r'-([A-Z]{2})')[0].tolist()}, \
\n Convert to comma sepearted: {df['Name'].str.replace('-', ', ').tolist()}, \
\n Number of vowels: {df['Name'].str.count(r'[aeiouAEIOU]').tolist()}, \
\n Left justified: {df['Name'].str.ljust(width=20).tolist()}, \
\n Leading zeros added: {df['Name'].str.zfill(width=20).tolist()}, \
\n Trailing zeros added: {df['Name'].str.ljust(width=20, fillchar='0').tolist()}")

# .cat accessor
print(f"\n =========== .cat Accessor ========= \
\n Codes: {df['Score'].cat.codes.tolist()}, \
\n Categories: {df['Score'].cat.categories.tolist()}, \
\n Are the categories ordered: {df['Score'].cat.ordered}, \
\n Reorder categories: {df['Score'].cat.reorder_categories(['F', 'E', 'D', 'C', 'B', 'A']).cat.codes.tolist()}, \
\n Categories: {df['Score'].cat.add_categories(['A+']).tolist()}, \
\n Is it ordered: {df['Score'].cat.remove_unused_categories().tolist()}")

In the following posts, I will choose topics from the following list (this is my initial list and therefore is subject to substantial change):

Nonlinear Dimensionality Reduction
Class Polymorphism
LSTM and GRU Simplified Explanation
ggplot Group Variable demystified, ggplot vs. seaborn
Ratios and Empirical Bayes
More Functional Programming Techniques
Tidy Data in R and Python
Useful Techniques for Data Wrangling in R (tidyverse) and Python (Pandas and Numpy)
Causal Inference meets Interpretable Machine Learning

You can help me improve these posts in at least two ways: 1) Proposing your topics of interest, and 2) Suggesting more elegant ways of doing a certain task than the one I presented in these posts. Hopefully, we will get a collection of posts and repository beneficial to all AI practitioners.