Do you want to learn how to use fake news to achieve your plans of world control and mass indoctrination using machine learning, NLP and python? Well… sounds interesting but that is a topic for another time.

When used correctly, it would be an understatement that fake news has the ability to present a weapon like no other. Although the existence of fake news is nothing new, fake news has recently gained much attention, due to the enormous amount of misinformation surrounding the novel coronavirus.

The aim of this article is to walk you through the process of creating a machine learning model using python and NLP in order to successfully detect fake news.

Key Terms

It is crucial, in order to proceed to become acquainted with certain key-terms that will be used throughout this article.

TfidfVectorizer

The TfidfVectorizer is used when one wishes to convert a collection of raw documents into a matrix of TF and IDF features.

IDF (Inverse Document Frequency)

The IDF is used as a measure of calculating how significant a word is in an entire corpus. To do so, it calculates how many times a word appears on a set of documents.

TF (Term Frequency)

The TF, unlike the IDF is the number of times a word appears in a single document.

Passive Aggressive Classifier

Passive Aggressive are considered algorithms that perform online learning (with for example twitter data). Their characteristic is that they remain passive when dealing with an outcome which has been correctly classified, and become aggressive when a miscalculation takes place, thus constantly self-updating and adjusting.

Preparing our dataset and work environment

First, we need to install a supported version of python. To do so, navigate to this link and follow the instructions for your operating system.

I will be using Python 3.6.9 and Ubuntu 18.04.4 LTS as my Operating System of choice. Nevertheless, all supported python versions are welcome.

Before proceeding with installing the required libraries, we must install pip. (I am pretty certain that pip comes with all python versions after 2.7.9 but if you do not have pip installed, follow this guide.)

Libraries

The following libraries should be installed with pip:

pip3 install pandaspip3 install sklearnpip3 install numpy

Dataset

Acquiring a suitable dataset presents one of the most crucial components of any data-science endeavor. For this case scenario, the dataset used will contain a number of articles. These articles will have been pre-classified as “REAL” or “FAKE” (referring to their contents). They will then be used as training data in order for the model to be able to determine whether an article presents factual information or simply fake news.

The following Kaggle dataset will be used.

To install it simply navigate to the given web address and download the “train.csv” file.

Once it has been successfully installed we will create a new directory that will look like this:

Coding

Now that we have both our libraries and dataset set-up, it is time to launch our text editors ( I will be using a jupyter notebook).

We will begin by importing all necessary libraries:

It is now time to import our dataset and get the shape of the data.

import itertools
import pandas as pd
import numpy as np
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

It becomes apparent that the csv file contains a dataset of 20800 rows and five distinct features (columns). In order to get some further insight concerning our data, we will continue by getting the head.

# Import dataset
df=pd.read_csv('path_to_your_file')

# Get the shape
df.shape

We can now view the first four records. By doing so we see that the data-set is divided into the following columns: id, title, author, text and label.

The features that are of interest to us are the label and text columns. The text column contains the contents of the article, whereas the label column represents whether the article is factual or not.

This has been pre-made for us in binary form using ‘1’ and ‘0’ by using:

‘1’ for FAKE NEWS‘0’ for RELIABLE article

Although normally this would be a perfect way to represent such values, for simplicity’s sake we are going to convert our ‘1’s and ‘0’s into the more human-friendly ‘REAL’ and ‘FAKE’ booleans.

Our data should now look like this:

We are now going to isolate the labels from the rest of the dataframe.

# Isolate the labels
labels = df.label
labels.head()

Once the previous operation has concluded, the dataset must be split into two distinct sets. 80% of the data will be used to train our model and the rest 20% will serve as testing data (this is obviously subject to change).

#Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'].values.astype('str'), labels, test_size=0.2, random_state=7)

We shall now declare a TfidfVectorizer using stop words from the English language (depends on the language of the articles) and we will allow up to a document frequency of 0.7 (for more information you can visit the TfidfVectorizer documentation here).

#Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

Now that we have a vectorizer, we are going to fit and transform it on the training set and also transform it on the testing set.

# Fit & transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

The PassiveAggressiveClassifier it to now be initialized. In order to incorporate it into our model, we are going to use the “y_train” and “tfidf_train”.

# Initialize the PassiveAggressiveClassifier and fit training sets
pa_classifier=PassiveAggressiveClassifier(max_iter=50)
pa_classifier.fit(tfidf_train,y_train)

Finally, we will be using the vectorizer to predict whether an article is reliable or not and we are going to calculate our model’s accuracy.

# Predict and calculate accuracy
y_pred=pa_classifier.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

We can now see the accuracy the model had while conducting its tests. Although we can view the accuracy, we do not know the number of successful predictions and failures to do so. In order to access such information, we will need to use a confusion matrix. This can be easily done by :

# Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

From the confusion matrix we can make the following conclusions:

Our model successfully predicted 2033 positivesOur model successfully predicted 1988 negatives.Our model predicted 67 false positivesOur model predicted 72 false negatives

Last thoughts

By using different iterations of the code and datasets I have managed to reach a maximum accuracy of 98.3%, with an average of 96.83%. It is quite astonishing as it is way more accurate than all other similar models I have encountered with a median accuracy of 86–92.82%.

It is essential that such models are further advanced in order to effectively combat misinformation in these difficult times.

This article was originally published by Filippos dounis on medium.