One of the most challenging area of Machine Learning is the one that regards the language and it is known as Natural Language Processing (NLP).

It is true that all the area of Machine Learning can be complex and challenging at some level, but NLP is particularly difficult as it requires to explore human communication and, somehow, human consciousness.

Moreover, while it is relatively easy to encode an image in terms of data (i.e. a bidimensional matrix), or a physics experiment (that is basically a .csv file), it is extremely harder to encode a text as a number or a vector.

But what do we actually want to solve? What are the so difficult tasks that I’m talking about? Well, for example, in this blog I will discuss an example of text classification. In particular, we want to classify wether or not the news are fake.

Moreover, we want to face this task using the State of Art methods proposed by BERT and a special encoder released by Google known as Universal Sentence Encoder. Plus, we will use a traditional Machine Learning tool that is becoming more and more popular for its easiness of use and its interesting features: PyCaret.

The theory that is behind BERT or the Universal Sentence Encoder is deep and complex, and it would be necessary more than a single blog to explain it further. Moreover, I would still not be able to explain them as precisely as their creators, so I’m not even going to try it.

On the other hand, the practical usage of these tools are really simple and will properly be explained during this post.

Let’s go.

1. The Dataset:

The dataset is open-source and can be found here. As it will be clearer, the real and fake news can be found in two different .csv files.

2. The Libraries:

In order to perform this classification, you need the basic Data Scientist starter pack (sklearn, pandas, numpy ,… , ), plus some specific libraries like transformers and pycaret. Here is a list:

import matplotlib.pyplot as plt import seaborn as sns import numpy as np import torch import torch.nn as nn from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report import transformers from transformers import AutoModel, BertTokenizerFast from sklearn.decomposition import PCA import tensorflow_hub as hub from pycaret.classification import * from sklearn.preprocessing import LabelEncoder from sklearn.metrics import plot_confusion_matrix #from googletrans import Translator plt.style.use('ggplot') plt.rcParams['font.family'] = 'sans-serif' plt.rcParams['font.serif'] = 'Ubuntu' plt.rcParams['font.monospace'] = 'Ubuntu Mono' plt.rcParams['font.size'] = 14 plt.rcParams['axes.labelsize'] = 12 plt.rcParams['axes.labelweight'] = 'bold' plt.rcParams['axes.titlesize'] = 12 plt.rcParams['xtick.labelsize'] = 12 plt.rcParams['ytick.labelsize'] = 12 plt.rcParams['legend.fontsize'] = 12 plt.rcParams['figure.titlesize'] = 12 plt.rcParams['image.cmap'] = 'jet' plt.rcParams['image.interpolation'] = 'none' plt.rcParams['figure.figsize'] = (10, 10 ) plt.rcParams['axes.grid']=False plt.rcParams['lines.linewidth'] = 2 plt.rcParams['lines.markersize'] = 8 colors = ['xkcd:pale range', 'xkcd:sea blue', 'xkcd:pale red', 'xkcd:sage green', 'xkcd:terra cotta', 'xkcd:dull purple', 'xkcd:teal', 'xkcd: goldenrod', 'xkcd:cadet blue', 'xkcd:scarlet'] bbox_props = dict(boxstyle="round,pad=0.3", fc=colors[0], alpha=.5) import pandas as pd import pycaret

Note: I had some problems with pycaret so I’ve decided to use it on Google Colab. Plus Google Colab is recommended because the dataset is large and it may affects your computer resources.

3. Data Exploration:

The real news and the fake ones are reported into two csvs.

true_data = pd.read_csv('True.csv')

fake_data = pd.read_csv('Fake.csv')

true_data.head()

A Target column is added and the data are merged & randomly mixed into a single dataframe known as data.

Let’s see if the dataset is well balanced.

plt.pie(label_size,explode=[0.1,0.1],colors=['firebrick','navy'],startangle=90,shadow=True,labels=['Fake','True'],autopct='%1.1f%%')

([<matplotlib.patches.Wedge at 0x7f8e0be74ad0>, <matplotlib.patches.Wedge at 0x7f8e0bdff510>], [Text(-1.1968727067385088, -0.0865778485782335, 'Fake'), Text(1.1968726986325005, 0.08657796063754254, 'True')], [Text(-0.6981757455974634, -0.05050374500396954, '52.3%'), Text(0.6981757408689586, 0.05050381037189981, '47.7%')])

The Target column is made of strings, and it is not computer-friendly. Let’s adjust it:

data['label']=pd.get_dummies(data.Target)['Fake']

So right now what we want to do is to take the title of an article and predict wether or not the news is fake.

Sounds good. Let’s start dancing.

4. Text Classification

4.1 BERT Fine-Tuning

Note: This part is inspired by this great article.

Ok, so now that we have the data we can start with the Machine Learning part. The idea behind the BERT fine-tuning is simple.

We have an extraordinary good model (and this hides the complexity of the approach) that is trained to perform good in terms of classification. We use this extraordinary good model (named BERT) and we fine tune it to perform our specific task. Pretty simple, isn’t it?

Now, follow me.

1.Train-Validation split

train_text, temp_text, train_labels, temp_labels = train_test_split(data['title'], data['label'], random_state=2018, test_size=0.3, stratify=data['Target'])

2.Validation-Test split

val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, random_state=2018, test_size=0.5, stratify=temp_labels)

3.Defining the model and the tokenizer of BERT.

bert = AutoModel.from_pretrained('bert-base-uncased') tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

4.Plotting the histogram of the number of words and tokenizing the text:

As almost all the texts have 15 words (approximatively), we truncate all the texts to 15 for computational reasons with few damage

seq_len = [len(i.split()) for i in train_text] pd.Series(seq_len).hist(bins = 40,color='firebrick') plt.xlabel('Number of Words') plt.ylabel('Number of texts')

MAX_LENGHT = 15 tokens_train = tokenizer.batch_encode_plus( train_text.tolist(), max_length = MAX_LENGHT, pad_to_max_length=True, truncation=True ) # tokenize and encode sequences in the validation set tokens_val = tokenizer.batch_encode_plus( val_text.tolist(), max_length = MAX_LENGHT, pad_to_max_length=True, truncation=True ) # tokenize and encode sequences in the test set tokens_test = tokenizer.batch_encode_plus( test_text.tolist(), max_length = MAX_LENGHT, pad_to_max_length=True, truncation=True )

5.Converting lists to tensors:

## convert lists to tensors train_seq = torch.tensor(tokens_train['input_ids']) train_mask = torch.tensor(tokens_train['attention_mask']) train_y = torch.tensor(train_labels.tolist()) val_seq = torch.tensor(tokens_val['input_ids']) val_mask = torch.tensor(tokens_val['attention_mask']) val_y = torch.tensor(val_labels.tolist()) test_seq = torch.tensor(tokens_test['input_ids']) test_mask = torch.tensor(tokens_test['attention_mask']) test_y = torch.tensor(test_labels.tolist())

6.Data Loader structure definition:

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler #define a batch size batch_size = 32 # wrap tensors train_data = TensorDataset(train_seq, train_mask, train_y) # sampler for sampling the data during training train_sampler = RandomSampler(train_data) # dataLoader for train set train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size) # wrap tensors val_data = TensorDataset(val_seq, val_mask, val_y) # sampler for sampling the data during training val_sampler = SequentialSampler(val_data) # dataLoader for validation set val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

7.Freezing the parameters and defining the trainable BERT structure:

for param in bert.parameters(): param.requires_grad = False

class BERT_Arch(nn.Module): def __init__(self, bert): super(BERT_Arch, self).__init__() self.bert = bert # dropout layer self.dropout = nn.Dropout(0.1) # relu activation function self.relu = nn.ReLU() # dense layer 1 self.fc1 = nn.Linear(768,512) # dense layer 2 (Output layer) self.fc2 = nn.Linear(512,2) #softmax activation function self.softmax = nn.LogSoftmax(dim=1) #define the forward pass def forward(self, sent_id, mask): #pass the inputs to the model cls_hs = self.bert(sent_id, attention_mask=mask)['pooler_output'] x = self.fc1(cls_hs) x = self.relu(x) x = self.dropout(x) # output layer x = self.fc2(x) # apply softmax activation x = self.softmax(x) return x

8.Defining the hyperparameters (optimizer, weights of the classes and the epochs)

from transformers import AdamW # define the optimizer optimizer = AdamW(model.parameters(), lr = 1e-5) # learning rate

from sklearn.utils.class_weight import compute_class_weight #compute the class weights class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels) print("Class Weights:",class_weights)

Class Weights: [1.04815902 0.95607204] /usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:72: FutureWarning: Pass classes=[0 1], y=30981 1 21128 0 3348 0 29696 0 35651 0 .. 41765 1 39076 1 31411 1 3001 1 25478 0 Name: label, Length: 31428, dtype: uint8 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error "will result in an error", FutureWarning)

weights= torch.tensor(class_weights,dtype=torch.float) # define the loss function cross_entropy = nn.NLLLoss(weight=weights) # number of training epochs epochs = 10

9. Defining training and evaluation functions:

def train(): model.train() total_loss, total_accuracy = 0, 0 # empty list to save model predictions total_preds=[] # iterate over batches for step,batch in enumerate(train_dataloader): # progress update after every 50 batches. if step % 50 == 0 and not step == 0: print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader))) # push the batch to gpu batch = [r for r in batch] sent_id, mask, labels = batch #print(type(labels),type(mask),type(sent_id)) #print(sent_id) # clear previously calculated gradients model.zero_grad() # get model predictions for the current batch preds = model(sent_id, mask) # compute the loss between actual and predicted values loss = cross_entropy(preds, labels) # add on to the total loss total_loss = total_loss + loss.item() # backward pass to calculate the gradients loss.backward() # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # update parameters optimizer.step() # model predictions are stored on GPU. So, push it to CPU preds=preds.detach().cpu().numpy() # append the model predictions total_preds.append(preds) # compute the training loss of the epoch avg_loss = total_loss / len(train_dataloader) # predictions are in the form of (no. of batches, size of batch, no. of classes). # reshape the predictions in form of (number of samples, no. of classes) total_preds = np.concatenate(total_preds, axis=0) #returns the loss and predictions return avg_loss, total_preds

10. Train and predict

best_valid_loss = float('inf') # empty lists to store training and validation loss of each epoch train_losses=[] valid_losses=[] #for each epoch for epoch in range(epochs): print('\n Epoch {:} / {:}'.format(epoch + 1, epochs)) #train model train_loss, _ = train() #evaluate model valid_loss, _ = evaluate() #save the best model if valid_loss < best_valid_loss: best_valid_loss = valid_loss torch.save(model.state_dict(), 'saved_weights.pt') # append training and validation loss train_losses.append(train_loss) valid_losses.append(valid_loss) print(f'\nTraining Loss: {train_loss:.3f}') print(f'Validation Loss: {valid_loss:.3f}')

Ok, I know that it is not super-funny, but you should follow all these steps exactly as I’ve written above. Each step is essential.

As a final step, let’s check the performance here:

preds = np.argmax(preds, axis = 1) print(classification_report(test_y, preds))

precision recall f1-score support 0 0.86 0.89 0.88 3213 1 0.90 0.87 0.88 3522 accuracy 0.88 6735 macro avg 0.88 0.88 0.88 6735 weighted avg 0.88 0.88 0.88 6735

Pretty interesting! 88% of accuracy, and high values of precision and recall as well.

Here is the classification matrix:

confusion_matrix(preds,test_y)

4.2 Universal Sentence Encoder + PyCaret

Here, the situation is much simpler. The Universal Sentence Encoder is an Encoder (a special way to transform a sentence in a vector) that has been trained on several classification tasks. This permits to change each instance of the dataset into a 512 dimensional vector. The embedding model is called with this line of code:

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

And the dataset is encoded using this one:

data_matrix = embed(data.title.tolist())

Let’s do the train-test split:

train_data = data.loc[0:int(len(data)*0.8)] test_data = data.loc[int(len(data)*0.8):len(data)]

At this point we want to apply traditional Machine Learning methods. In particular, PyCaret permits to adopt almost all the most famous and efficient classification algorithms and compare them. Nonetheless, we don’t want to use 512 x 40000+ numbers, so it is wise to perform a Principal Component Analysis (PCA) dimensional reduction.

Let’s do the PCA with 3 components (it will be clear why):

pca = PCA(n_components=3) pca_data = pca.fit(data_matrix[0:len(train_data)]) pca_train = pca.transform(data_matrix[0:len(train_data)])

Now it goes interesting.

If you look at the PCA dataset it gets like that:

pca_3_data = pd.DataFrame({'First Component':pca_train[:,0],'Second Component':pca_train[:,1],'Third Component':pca_train[:,2],'Target': train_data.Target})

plt.figure(figsize=(20,10)) plt.subplot(1,3,1) sns.scatterplot(x='First Component', y = 'Second Component',hue='Target',data=pca_3_data,s=2) plt.grid(True) plt.subplot(1,3,2) sns.scatterplot(x='First Component', y = 'Third Component',hue='Target',data=pca_3_data,s=2) plt.grid(True) plt.subplot(1,3,3) sns.scatterplot(x='Second Component', y = 'Third Component',hue='Target',data=pca_3_data,s=2) plt.grid(True)

That means that data are almost linearly separable!

Now let’s use PyCaret and its Machine Learning models:

Now, the best model is saved as best_model and it is the Random Forest Classifier. Let’s use it to predict the test set.

Similar results (88% of accuracy) and similar confusion matrix too:

plot_confusion_matrix(best_model,pca_test,y_true,cmap='plasma')

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f54306b9610>

5 Conclusions

We are in the fantastic era of Deep Learning. One of the great thing about it is that while it is extremely difficult to train a state of art neural network, it is way easier and faster to use a pretrained neural network, fine tune it and obtain state of art results on your dataset.