Time series are a huge part of our lives. Basically everything can be modelled as a certain quantity (on the y axis) that varies as the time increases (on the x axis).

On the other hand, classification is an important application of Machine Learning. In fact, it is easy to consider lots of our goals as a classification task.

Combining these two things together we have time series classification. Our goal is easy to be determined:

We want a model that, given a time series (i.e. a quantity that varies as the time increases), is able to output a class.

There are a tons of article about it that explains the theory that is behind this goal (in my opinion, a very good job is made by Marco Del Pra in this article). This article will not focus on the theory but will give a practical guide about how to classify real world time series building your classifier from scratch, using Python.

In our specific case, we want to distinguish the continent of a country (Europe, Asia or Africa) given the time series of its temperature.

So let’s get started!

Please Note: This article is meant to follow you, step-by-step, to the solution. If you are not interested in the pre-processing part, please directly skip to the Machine Learning model section or the results’ one.

0. The Libraries

The first thing we want to do is to call the help of some friends :)

Here is what we’ll need to get the work done:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_classif as mi
from sklearn.metrics import plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

plt.style.use('ggplot')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.serif'] = 'Ubuntu'
plt.rcParams['font.monospace'] = 'Ubuntu Mono'
plt.rcParams['font.size'] = 14
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['legend.fontsize'] = 12
plt.rcParams['figure.titlesize'] = 12
plt.rcParams['image.cmap'] = 'jet'
plt.rcParams['image.interpolation'] = 'none'
plt.rcParams['figure.figsize'] = (12, 10)
plt.rcParams['axes.grid']=False
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.markersize'] = 8
colors = ['xkcd:pale orange', 'xkcd:sea blue', 'xkcd:pale red', 'xkcd:sage green', 'xkcd:terra cotta', 'xkcd:dull purple', 'xkcd:teal', 'xkcd: goldenrod', 'xkcd:cadet blue',
'xkcd:scarlet']

1. The Dataset

For our experiment, I used a dataset that I know very well and it can be used for our goal. It is a time-series made of earth surface temperature data, collected country by country.

Here is an example, directly from UK:

Time Series of United Kingdom. [Image made by author, using Python]

2. Data Pre-Processing

Now that you have download your .csv it’s time to read it, with pandas.

data = pd.read_csv('GlobalLandTemperaturesByCountry.csv')
data.head()

By looking at the first rows we have something that we want to consider:

A. We don’t have the “Continent” information yet, which is the one we want to call “target”
B. We have to deal with some NaN values on the AverageTemperature column, that is the one that we’ll consider to be the y-axis of the time series

Let’s solve these problems, one step at a time.
The first thing we want to use is a magical library that we imported, namely pycountry_convert.

With the following lines of code we are able to get the “Continent” out of a country name:

country_code = pc.country_name_to_country_alpha2(c, cn_name_format="default")
continent_name = pc.country_alpha2_to_continent_code(country_code)

But, sadly, this does not work for the entire dataset, as we can see from the application of this code on our data:

countries = data.Country.drop_duplicates().to_list()
unknown_continent_countries = []
continent_countries = []
known_continent_countries = []
for c in countries:
try:
country_code = pc.country_name_to_country_alpha2(c, cn_name_format="default")
continent_name = pc.country_alpha2_to_continent_code(country_code)
continent_countries.append(continent_name)
known_continent_countries.append(c)

except:
print('The country named %s has no continent' %(c))
unknown_continent_countries.append(c)
print('The continents that appears in the dataset are' ,list(set(continent_countries)))

A lot of countries are not recognized by pycountry_convert, so I manually added the continents by myself (thank me later ;) ):

to_add_continents= ['EU','AF','AN','NA','AS','OC','SA','EU','AS','AF','AF','EU','EU','SA','OC','EU','EU','AS','AF','AN',
'EU','OC','EU','NA','OC','AS','OC','AF','NA','NA','NA','AF','NA','SA','OC','EU','AS','SA','NA','EU',
'NA','AF']
if len(to_add_continents)==len(unknown_continent_countries):
print('All the countries have a related continent')

As we can see, now all countries have a related continent, which solves the A point!

Some countries are from Antarctica, but they form a really small number that, at this stage is not relevant to add to our dataset (by the way, it is extremely easy to determine if a country is in Antarctica or not, isn’t it? ).

So we’ll eventually need to cut these countries off our dataset. Moreover we have that NaN problem to solve. We will do it with the magical pd.fill_na() function!

Here’s the code:

data.AverageTemperature = data.AverageTemperature.fillna(method='ffill')
data = data.drop(columns='AverageTemperatureUncertainty')
data = data[data.Continent!='AN']
data.tail()

And it solves the B problem!
Unfortunately, we are not done yet. We need to make sure that each country has the same number of entries in the time series. In our dataset, some countries have 1743 as the first year, some countries have 1948. We have to standardize the dataset to the same starting year in order to have the same number of entries. In particular, we will use the most recent year as the starting one for the entire dataset.

This is the number of entries before our standardization:

number_of_entries = []
countries= data.Country.drop_duplicates().to_list()
for c in countries:
number_of_entries.append(len(data[data['Country']==c]))
sns.kdeplot(number_of_entries,fill=True)
plt.xlabel('Number of entries')

This is the standardization code and the standardized dataset:

number_of_entries = np.array(number_of_entries)
min_number = number_of_entries.min()
red_data = data[data['Country']==countries[0]]
red_data = red_data[len(red_data)-min_number::]
for i in range(1,len(countries)):
data_i = data[data['Country']==countries[i]]
data_i = data_i[len(data_i)-min_number::]
red_data = red_data.append(data_i)
red_data.head()

And this is the histogram of the number of entries after the standardization:

number_of_entries = []
for c in countries:
number_of_entries.append(len(red_data[red_data['Country']==c]))
plt.hist(number_of_entries)
plt.xlabel('Number of entries')

Ok let’s give a look at our classes:

sns.countplot(red_data.Continent,palette='plasma')

And let’s consider the three most populated classes: Europe, Asia and Africa. Moreover, let’s collect all the labels of our dataset:

red_data = red_data[(red_data['Continent']=='EU')|(red_data['Continent']=='AS')|(red_data['Continent']=='AF')]
red_data_label = red_data.reset_index().drop('index',axis=1)
tot = int(len(red_data_label)/min_number)
labels=[]
for i in range(0,tot):
labels.append(red_data_label.loc[i*min_number:(i+1)*min_number].Continent.to_list()[0])

Plus, let’s introduce some functions that we will use to plot the data:

def pick_city(data):
cities = data.Country.drop_duplicates().to_list()
city = np.random.choice(cities)
return data[data['Country']==city]
def plot_data(data,color='navy'):
data = data.reset_index().drop(columns='index')
city = data_i[0:1].Country.to_list()[0]
index_to_plot = np.linspace(0,len(data)-1,5).astype(int)
indexes = data.index.to_list()
x = data.dt
y = data.AverageTemperature
sns.lineplot(x=indexes,y=y,label=city,color=color)
plt.xticks(index_to_plot,data.dt.loc[index_to_plot])
plt.legend(fontsize=20)
plt.xlabel('Date',fontsize=30)
plt.ylabel('Temperature',fontsize=30)

As we can see by plotting three countries of our data, our task is not easy at all!

continents = red_data.Continent.drop_duplicates().to_list()
data_list = []
for continent in continents:
data_list.append(red_data[red_data.Continent==continent])
i=1
colors = ['navy','firebrick','darkorange','k','gold','darkgrey','purple']
plt.figure(figsize=(30,30))
for cont_data in data_list:
plt.subplot(3,3,i)
data_i = pick_city(cont_data)
continent = data_i.Continent.to_list()[0]
plt.title('Continent =%s'%(continent),fontsize=30)
plot_data(data_i,color=colors[i-1])
plt.xticks(fontsize=20)
plt.yticks(fontsize=30)
i=i+1
plt.tight_layout()

Finally, let’s prepare our dataset for the Machine Learning model:

labels = np.array(labels)
le = LabelEncoder()
le = le.fit(labels)
labels = np.array(le.transform(labels))
red_data = red_data.reset_index().drop(columns='index')
len_data = len(red_data)
len_train_country = int((len_data/min_number)*0.8)
len_train = int(len_train_country*min_number)
red_train_data = red_data[0:len_train]
red_test_data = red_data[len_train:len_data]
len_test_country = int((len(red_test_data)/min_number))

x = np.array(red_data.AverageTemperature)
y = labels
x = x.reshape(min_number,int(len(x)/min_number))
idx = np.arange(0,len(y),1)
idx_random = np.random.choice(idx,len(y))
x_train,y_train = x[idx_random][0:len_train_country],y[idx_random][0:len_train_country]
x_test,y_test = x[idx_random][len_train_country:len(y)],y[idx_random][len_train_country:len(y)]
x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], 1))
x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], 1))
num_classes = 3

3. The Machine Learning Model

The Model is a 1D Convolutional Neural Network. The structure is very easy to understand, and computationally cheap. Moreover, its details are furnished by Keras here.

Here is the model:

def make_model(input_shape):
input_layer = keras.layers.Input(input_shape)

conv1 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(input_layer)
conv1 = keras.layers.BatchNormalization()(conv1)
conv1 = keras.layers.ReLU()(conv1)

conv2 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv1)
conv2 = keras.layers.BatchNormalization()(conv2)
conv2 = keras.layers.ReLU()(conv2)

conv3 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv2)
conv3 = keras.layers.BatchNormalization()(conv3)
conv3 = keras.layers.ReLU()(conv3)

gap = keras.layers.GlobalAveragePooling1D()(conv3)

output_layer = keras.layers.Dense(num_classes, activation="softmax")(gap)

return keras.models.Model(inputs=input_layer, outputs=output_layer)

model = make_model(input_shape=x_train.shape[1:])

And its summary:

model.summary()

Let’s train the model:

epochs = 500
batch_size = 5

callbacks = [
keras.callbacks.ModelCheckpoint(
"best_model.h5", save_best_only=True, monitor="loss"
),
keras.callbacks.ReduceLROnPlateau(
monitor="loss", factor=0.5, patience=20, min_lr=0.0001
),
keras.callbacks.EarlyStopping(monitor="loss", patience=50, verbose=1),
]
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["sparse_categorical_accuracy"],
)
history = model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=epochs,
callbacks=callbacks,
verbose=1,
)

4. The Results:

A 79% accuracy has been obtained:

pred = model.predict(x_test)
res = []
for p in pred:
res.append(p.argmax())
print('The accuracy score on the test set is %.2f'%accuracy_score(res,y_test))

Here is the other metrics:

print(classification_report(res,y_test))

And here is the confusion matrix:

res = le.inverse_transform(res)
y_test = le.inverse_transform(y_test)
results = pd.DataFrame({'Predict':res,'Target':y_test})
results.head()

sns.heatmap(confusion_matrix(results['Predict'],results['Target']),annot=True,xticklabels=['Africa','Asia','Europe']
,yticklabels=['Africa','Asia','Europe'],cmap='plasma')
plt.ylabel('Target')
plt.xlabel('Predicted')

Final Considerations:

While bigger and bigger models are often built to solve incredibly complex problems, sometimes a considerably small Neural Network can be able to get acceptable results, with limited computational resources.

I think that this consideration applies to the dataset we analyzed together, and I really hope you had fun like I did!

If you have any questions or considerations, I’m glad to hear it!