Text Classification with Extremely Small Datasets

A guide to making the most of your tiny datasets


Anirudh S

2 years ago | 26 min read

As the saying goes, in this era of deep learning “data is the new oil”. However, unless you work for a Google, a Facebook or some other tech giant, getting access to adequate data can be a tough task.

This is especially true for small companies operating in niche domains or personal projects that you or I might have.

In this blog, we’ll simulate a scenario where we only have access to a very small dataset and explore this concept at length. In particular, we’ll build a text classifier that can detect clickbait titles and experiment with different techniques and models to deal with small datasets.

Blog Outline:

  1. What is Clickbait?
  2. Getting the dataset
  3. Why are small datasets a pain in ML?
  4. Splitting the Data
  5. Simple Exploratory Data Analysis
  6. Bag-of-Words, TF-IDF, and Word Embeddings
  7. Feature Engineering
  8. Exploring Models and Hyperparameter Tuning
  9. Dimensionality Reduction
  10. Summary

1. What is clickbait?

Often, you might have come across titles like these:

“We tried building a classifier with a small dataset. You Wont Believe What Happens Next!”
“We love these 11 techniques to build a text classifier. # 7 will SHOCK you.”
“Smart Data Scientists use these techniques to work with small datasets. Click to know what they are”

These types of catchy titles are all over the internet. But what makes a title “Clickbait-y”? Wikipedia defines it as :

Clickbait is a form of false advertisement which uses hyperlink text or a thumbnail link that is designed to attract attention and entice users to follow that link and read, view, or listen to the linked piece of online content, with a defining characteristic of being deceptive, typically sensationalized or misleading.

In general, the question of whether a post is clickbait or not seems to be rather subjective. (Check out: “Why BuzzFeed Doesn’t Do Clickbait” [1]). This means that while finding a dataset, it would be best to look for one that is manually reviewed by multiple people.

2. Getting the dataset

After some searching, I found: Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media by Chakraborty et al (2016)[2] and their accompanying Github repo

The dataset contains 15,000+ article titles that have been labeled as clickbait and Non-clickbait. The non-clickbait titles come from Wikinews and have been curated by the Wikinews community while the clickbait titles come from ‘BuzzFeed’, ‘Upworthy’ etc.

To ensure there aren’t any false positives, the titles labeled as clickbait were verified by six volunteers and each title was further labeled by at least three volunteers. Section 2 of the paper contains more details.

Baseline performance: The authors used 10-fold CV on a randomly sampled 15k dataset (balanced). The best results they achieved were with RBF-SVM achieving an accuracy of 93%, Precision 0.95, Recall 0.9, F1 of 0.93, ROC-AUC of 0.97

So here’s our challenge:

We’ll work with 50 data points for our train set and 10000 data points for our test set. This means the train set is just 0.5% of the test set. We will not use any part of our test set in training and it will merely serve the purpose as a leave-out validation set.

Evaluation Metrics:

Keeping track of performance metrics will be critical in understanding how well our classifier is doing as we progress through different experiments. F1-Score will be our main performance metric but we’ll also keep track of Precision, Recall, ROC-AUC and Accuracy.

3. Why are small datasets a pain in ML?

Before we dive in, it’s important to understand why small datasets are difficult to work with:

  1. Overfitting: When the dataset is small the classifier has more degrees of freedom to construct the decision boundary. To demonstrate this, I trained a Random Forest Classifier 6 times on the same dataset (a modified version of the Iris Dataset with only 8 points)
Varying Decision Boundaries for a small dataset
Varying Decision Boundaries for a small dataset

Notice how the decision boundary changes wildly. This is because the classifier struggles to generalize with the small amount of data. Mathematically, this means our prediction will have high variance.

Potential Solutions:

I. Regularization: We’ll have to use large amounts of L1, L2 and other forms of regularization.

Ii. Simpler models: Low complexity linear models like Logistic Regression and SVMs will tend to perform better as they have smaller degrees of freedom.

2. Outliers:

Outliers have dramatic effects on small datasets as they can skew the decision boundary significantly. In the plots below I added some noise and changed the label of one of the data points making it an outlier — notice the effect this has on the decision boundary.

Effect of Outliers on the Decision Boundary
Effect of Outliers on the Decision Boundary

Potential Solution:

Outlier detection and Removal: We can use clustering algorithms like DBSCAN or ensemble methods like Isolation Forests

3. High Dimensionality :

As more features are added, the classifier has a higher chance to find a hyperplane to split the data. However, if we increase the dimensionality without increasing the number of training samples, the feature space becomes more sparse and the classifier overfits easily. This is a direct result of the curse of dimensionality — best explained in this blog

Potential Solutions:

I. Decomposition Techniques: PCA/SVD to reduce the dimensionality of the feature space

II. Feature Selection: To remove features that aren’t useful in prediction.

We’ll dive into these solutions in this blog.

4. Splitting the Data

Let’s begin by splitting our data into train and test sets. As mentioned earlier, we’ll use 50 data points for train and 10000 data points for test.

(To keep things clean here I’ve removed some trivial code: You can check the GitHub repo for the complete code)

data = pd.DataFrame(clickbait_data)#Now lets split the datafrom sklearn.model_selection import train_test_splittrain, test = train_test_split(data, shuffle = True, stratify = data.label, train_size = 50/data.shape[0], random_state = 50)test, _ = train_test_split(test, shuffle = True,
stratify = test.label, train_size = 10000/test.shape[0], random_state = 50)train.shape, test.shapeOutput:
((50, 2), (10000, 2))

An important step here is to ensure that our train and test sets come from the same distribution so that any improvements on the train set is reflected in the test set.

A common technique used by Kagglers is to use “Adversarial Validation” between the different datasets. (I’ve seen it go by many names, but I think this one is the most common)

The idea is very simple, we mix both datasets and train a classifier to try and distinguish between them. If the classifier fails to do so — we can conclude that the distributions are similar. You can read more here:

ROC AUC is the preferred metric — a value of ~ 0.5 or lower means the classifier is as good as a random model and the distributions are the same.

Code for Adversarial Validation

Let’s use Bag-Of-Words to encode the titles before doing adversarial validation

bow = CountVectorizer()
x_train = bow.fit_transform(train.title.values)
x_test = bow.transform(test.title.values)x_test = shuffle(x_test)adversarial_validation(x_train, x_test[:50])Output:
Logisitic Regression AUC : 0.384
Random Forest AUC : 0.388

The low AUC value suggests that the distributions are similar.

Just to see what would happen if the distributions were different, I ran a web crawler on, a news source that is not used in the dataset, and collected some article titles.

bow = CountVectorizer()
x_train = bow.fit_transform(breitbart.title.values)
x_test = bow.transform(test.title.values)x_train = shuffle(x_train)
x_test = shuffle(x_test)
adverserial_validation(x_train[:50], x_test[:50])Output:
Logisitic Regression AUC : 0.720
Random Forest AUC : 0.794

The AUC values are much higher indicating that the distributions are different.

Now let’s move ahead and do some basic EDA on the train dataset.

5. Simple Exploratory Data Analysis

Let’s start by checking if the datasets are balanced:

print('Train Positive Class % : {:.1f}'.format((sum(train.label == 'clickbait')/train.shape[0])*100))
print('Test Positive Class % : {:.1f}'.format((sum(test.label == 'clickbait')/test.shape[0])*100))print('Train Size: {}'.format(train.shape[0]))
print('Test Size: {}'.format(test.shape[0]))Output:
Train Positive Class % : 50.0
Test Positive Class % : 50.0
Train Size: 50
Test Size: 10000

Next, let’s check the effect of number of words.

Looks like Clickbait titles have more words in them. What about mean word length?

Clickbait titles use shorter words as compared to non-clickbait titles. Since clickbait titles generally have simpler words, we can check what % of the words in the titles are stop-words

Strange, the clickbait titles seem to have no stopwords that are in the NLTK stopwords list. This is probably a coincidence because of the train-test split or we need to expand our stop word list.

Something to explore during feature engineering for sure. Also, stop word removal as a preprocessing step is not a good idea here.

A word cloud can help us identify words that are more prominent in each class. Let’s take a look:

Wordcloud for Clickbait Titles
Wordcloud for Clickbait Titles
Wordcloud for Non-Clickbait Titles
Wordcloud for Non-Clickbait Titles

The distribution of words is quite different between clickbait and non-clickbait titles.

For eg: Non-clickbait titles have states/countries like “Nigeria”, “China”, “California” etc and words more associated with the news like “Riots”, “Government” and “bankruptcy”. Non-clickbait titles seem to have more generic words like “Favorite”, “relationships”, “thing” etc

Using Bag-Of-Words, TF-IDF or word embeddings like GloVe/W2V as features should help here. At the same time, we might also be able to get a lot of performance improvements with simple text features like lengths, word-ratios, etc.

Let’s try TSNE on Bag-of-Words encoding for the titles:

TSNE on BoW Title Encodings
TSNE on BoW Title Encodings

Both the classes seem to be clustered together with BoW encoding. In the next section, we’ll explore different embedding techniques.

Utility Functions:

Before we start exploring embeddings lets write a couple of helper functions to run Logistic Regression and calculate evaluation metrics.

Since we want to optimize our model for F1-Scores, for all models we’ll first predict the probability of the positive class. We’ll then use these probabilities to get the Precision-Recall curve and from here we can select a threshold value that has the highest F1-score. To predict the labels we can simply use this threshold value.

Utility Functions to calculate F1 and run Log Reg

6. Bag-of-Words, TF-IDF and Word Embeddings

In this section, we’ll encode the titles with BoW, TF-IDF and Word Embeddings and use these as features without adding any other hand-made features.

Starting with BoW and TF-IDF:

y_train = np.where(train.label.values == 'clickbait', 1, 0)
y_test = np.where(test.label.values == 'clickbait', 1, 0)from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerbow = CountVectorizer()
x_train = bow.fit_transform(train.title.values)
x_test = bow.transform(test.title.values)run_log_reg(x_train, x_test, y_train, y_test)from sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(train.title.values)
x_test = tfidf.transform(test.title.values)run_log_reg(x_train, x_test, y_train, y_test)Output:
For BoW:
F1: 0.782 | Pr: 0.867 | Re: 0.714 | AUC: 0.837 | Accuracy: 0.801For TF-IDF:
F1: 0.829 | Pr: 0.872 | Re: 0.790 | AUC: 0.896 | Accuracy: 0.837

TFIDF performs slightly better than BoW. An interesting fact is that we’re getting an F1 score of 0.837 with just 50 data points. This is why Log Reg + TFIDF is a great baseline for NLP classification tasks.

Next, let’s try 100-D GloVe vectors. We’ll use the PyMagnitude library:(PyMagnitude is a fantastic library that includes great features like smart out-of-vocab representations. Highly recommended!)

Since titles can have varying lengths, we’ll find the GloVe representation for each word and average all of them together giving a single 100-D vector representation for each title.

# We'll use Average Glove here
from tqdm import tqdm_notebook
from nltk import word_tokenize
from pymagnitude import *glove = Magnitude("./vectors/glove.6B.100d.magnitude")def avg_glove(df):
vectors = []
for title in tqdm_notebook(df.title.values):
vectors.append(np.average(glove.query(word_tokenize(title)), axis = 0))
return np.array(vectors)x_train = avg_glove(train)
x_test = avg_glove(test)run_log_reg(x_train, x_test, y_train, y_test)Output:
F1: 0.929 | Pr: 0.909 | Re: 0.950 | AUC: 0.979 | Accuracy: 0.928

Woah! That’s a huge increase in F1 score with just a small change in title encoding. The improved performance is justified since W2V are pre-trained embeddings that contain a lot of contextual information.

This would contribute to the performance of the classifier, especially when we have a very limited dataset.

Instead of just taking the average of each word, what if we did a weighted average — in particular, IDF-Weighted average?

from sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer() Now lets create a dict so that for every word in the corpus we have a corresponding IDF value
idf_dict = dict(zip(tfidf.get_feature_names(), tfidf.idf_))# Same as Avg Glove except instead of doing a regular average, we'll use the IDF values as weights.def tfidf_glove(df):
vectors = []
for title in tqdm_notebook(df.title.values):
glove_vectors = glove.query(word_tokenize(title))
weights = [idf_dict.get(word, 1) for word in word_tokenize(title)]
vectors.append(np.average(glove_vectors, axis = 0, weights = weights))
return np.array(vectors)x_train = tfidf_glove(train)
x_test = tfidf_glove(test)run_log_reg(x_train, x_test, y_train, y_test)Output:
F1: 0.957 | Pr: 0.943 | Re: 0.971 | AUC: 0.989 | Accuracy: 0.956

Our F1 increased by ~0.02 points. The increased performance makes sense — commonly occurring words get less weightage while less frequent (and perhaps more important) words have more say in the vector representation for the titles.

Since GloVe worked so well, let’s try one last embedding technique — Facebook’s InferSent model. This model converts the entire sentence into a vector representation. However, a potential problem is that the vector representations are 4096 dimensional which might cause our model to overfit easily. Let’s give it a shot anyway:

from InferSent.models import InferSent
import torchMODEL_PATH = './encoder/infersent1.pkl'
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,'pool_type': 'max', 'dpout_model': 0.0, 'version': 1}infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))infersent.set_w2v_path('GloVe/glove.840B.300d.txt')infersent.build_vocab(train.title.values, tokenize= False)x_train = infersent.encode(train.title.values, tokenize= False)
x_test = infersent.encode(test.title.values, tokenize= False)run_log_reg(x_train, x_test, y_train, y_test, alpha = 1e-4)Output:
F1: 0.927 | Pr: 0.912 | Re: 0.946 | AUC: 0.966 | Accuracy: 0.926

As expected the performance drops — most likely due to overfitting from the 4096-dimensional features.

Before we end this section, let’s try TSNE again this time on IDF-Weighted Glove vectors

This time we see some separation between the 2 classes in the 2D projection. To some extent, this explains the high accuracy we achieved with simple Log Reg.

To increase performance further, we can add some hand made features. Let’s try this in the next section.

7. Feature Engineering

Creating new features can be tricky. The best way to get a headstart on this is to dive into the domain and look for research papers, blogs, articles, etc. Kaggle Kernels in related domains are also a good way to find information on interesting features.

For clickbait detection, the paper we used for the dataset (Chakraborthy et al) mentioned a few features they used. I also found Potthast et al (2016) [3] in which they documented over 200 features.

We can implement some of the easy ones along with the Glove embeddings from the previous section and check for any performance improvements. Here’s a quick summary of the features:

  1. Starts with Number: A boolean feature that checks if a title starts with a number. Eg: “11 incredible ways to do XYZ”
  2. Clickbait Phrases: Downworthy is a hilarious chrome extension that replaces clickbait titles with “more realistic headlines”. The Github repo lists popular clickbait phrases like ‘Everything You Need to Know’, ‘This Is What Happens’ etc. We can use this list to check if the titles in our dataset contain any of these phrases. Chakraborthy et al also provide a list of further phrases.
  3. Clickbait RegEx: Downworthy also has a couple of regex expressions that we can match with the titles.
  4. Number of dots
  5. Readability Scores: Calculate the Flesch-Kincaid grade and Dale-Chall readability scores. These scores provide an idea of the ease of reading the titles. The textstat python package provides an easy way to implement these. Generally, we expect news titles to be harder to read.
  6. Simple text features: Length of the longest word, mean word length in characters and length of the title in characters.
  7. Number of Punctuations
  8. Word Ratios: Here we calculate 6 different ratios: (i). Easy Words (as defined by Dale Chall Easy Words List) (ii) Stop Words (iii) Contractions (eg: aren’t, shouldn’t, etc) (iv) Hyperbolic words (as defined by Chakraborthy et al eg: Amazing, Incredible etc) (v) Clickbait Subjects (Chakraborthy et al define some noun/subjects that are more prevalent in clickbait titles like ‘Guys’, ‘Dogs’ etc) (vi) Non Clickbait Subjects (same as above for Non-clickbait titles eg: India, Iran, Government etc)
  9. Number of Hashtags
  10. Sentiment Scores: We can use NLTK’s Vader Sentiment analyzer to get Negative, Neutral, Positive and Compound scores for each title.
  11. Embeddings: TFIDF/Glove/Weighted Glove

After implementing these we can choose to expand the feature space with polynomial (eg X²) or interaction features (eg XY) by using sklearn’s PolynomialFeatures()

Note: The choice of feature scaling technique made quite a big difference to the performance of the classifier, I tried RobustScaler, StandardScaler, Normalizer and MinMaxScaler and found that MinMaxScaler worked the best.

from featurization import *train_features, test_features, feature_names = featurize(train, test, 'tfidf_glove')run_log_reg(train_features, test_features, y_train, y_test, alpha = 5e-2)Output:
F1: 0.964 | Pr: 0.956 | Re: 0.972 | AUC: 0.993 | Accuracy: 0.964

Nice! We went from an F1 score of 0.957 to 0.964 on simple logistic regression. We might be able to squeeze out some more performance improvements when we try out different models and do hyperparameter tuning later.

For now, let’s take a short detour into model interpretability to check how our model is making these predictions. We’ll use the SHAP and ELI5 libraries to understand the importance of the features.

Feature Weights

Let’s start with feature importance. This is pretty straightforward with the ELI5 library.

from sklearn.linear_model import SGDClassifier
import eli5# Train a Log Reg Classifier
log_reg = SGDClassifier(loss = 'log', n_jobs = -1, alpha = 5e-2), y_train)#Pass the model instance along with the feature names to ELI5
eli5.show_weights(log_reg, feature_names = feature_names, top = 100)
Feature Weights
Feature Weights
Feature Weights
Feature Weights

Apart from the glove dimensions, we can see a lot of the hand made features have large weights. The greener a feature is the more important it is to classify the sample as ‘clickbait’.

For example, the starts_with_number feature is very important to classify a title is clickbait. This makes sense because, in the dataset, titles like "12 reasons why you should XYZ” are often clickbait.

Let’s take a look at the dale_chall_readability_score feature which has a weight of -0.280. If the Dale Chall Readability score is high, it means that the title is difficult to read. Here, our model has learned that if a title is more difficult to read, it is probably a News title and not clickbait. Pretty cool!

In addition, there are some features that have a weight very close to 0. Removing these features might help in reducing overfitting, we’ll explore this in the Feature Selection section.

SHAP Force Plot

Now let’s move onto the SHAP Force Plot

import shaplog_reg = SGDClassifier(loss = 'log', n_jobs = -1, alpha = 5e-2), y_train)explainer = shap.LinearExplainer(log_reg, train_features, feature_dependence = 'independent')
shap_values = explainer.shap_values(test_features)shap.initjs()
ind = 0
shap.force_plot(explainer.expected_value, shap_values[ind,:], test_features.toarray()[ind,:],
feature_names = feature_names)
SHAP Force Plot
SHAP Force Plot

A force plot is like a ‘tug-of-war’ game between features. Each feature pushes the output of the model to the left or right of the base value. The base value is the average output of the model over the entire Test dataset. Keep in mind this is not a probability value.

Features in pink help the model detect the positive class i.e. ‘Clickbait’ titles while features in blue detect the negative class. The width of each feature is directly proportional to its weightage in the prediction.

In the example above, the starts_with_number feature is 1 and has a lot of importance and hence pushes the model's output to the right. On the other hand, clickbait_subs_ratio and easy_words_ratio (high values in these features usually indicate clickbait, but in this case, the values are low) are both pushing the model to the left.

We can verify that in this particular example, the model ends up predicting ‘Clickbait’

print('Title: {}'.format(test.title.values[0]))
print('Label: {}'.format(test.label.values[0]))
print('Prediction: {}'.format(log_reg.predict(test_features.tocsr()[0,:])[0]))Output:
Title: 15 Highly Important Questions About Adulthood, Answered By Michael Ian Black
Label: clickbait
Prediction: 1

As expected, the model correctly labels the title as clickbait.

Let’s take a look at another example:

SHAP Force Plot for Non-Clickbait Titles
SHAP Force Plot for Non-Clickbait Titles
print('Title: {}'.format(test.title.values[400]))
print('Label: {}'.format(test.label.values[400]))
print('Prediction: {}'.format(log_reg.predict(test_features.tocsr()[400,:])[0]))Output:
Title: Europe to Buy 30,000 Tons of Surplus Butter
Label: not-clickbait
Prediction: 0

In this case, the model gets pushed to the left since features like sentiment_pos (clickbait titles usually have a positive sentiment) have a low value.

Force plots are a wonderful way to take a look at how models do prediction on a sample-by-sample basis. In the next section, we’ll try different models including ensembles along with hyperparameter tuning.

8. Exploring Models and Hyperparameter Tuning

In this section, we’ll use the features we created in the previous section, along with IDF-weighted embeddings and try them on different models.

As mentioned earlier, when dealing with small datasets, low-complexity models like Logistic Regression, SVMs, and Naive Bayes will generalize the best. We’ll try these models along with non-parameteric models like KNN and non-linear models like Random Forest, XGBoost, etc.

We’ll also try bootstrap-aggregating or bagging with the best-performing classifier as well as model stacking. Let’s get started!

For hyperparameter tuning GridSearchCV is a good choice for our case since we have a small dataset (allowing it to run quickly) and it's an exhaustive search. We’ll need to do a few hacks to make it (a) use our predefined test set instead of Cross-Validation (b) use our F1 evaluation metric which uses PR curves to select the threshold.

GridSearchCV with PrefededfinedSplit

Logistic Regression

from sklearn.linear_model import SGDClassifierlr = SGDClassifier(loss = 'log')
lr_params = {'alpha' : [10**(-x) for x in range(7)],
'penalty' : ['l1', 'l2', 'elasticnet'],
'l1_ratio' : [0.15, 0.25, 0.5, 0.75]}best_params, best_f1 = run_grid_search(lr, lr_params, X, y)print('Best Parameters : {}'.format(best_params))lr = SGDClassifier(loss = 'log',
alpha = best_params['alpha'],
penalty = best_params['penalty'],
l1_ratio = best_params['l1_ratio']), y_train)
y_test_prob = lr.predict_proba(test_features)[:,1]
print_model_metrics(y_test, y_test_prob)Output:
Best Parameters : {'alpha': 0.1, 'l1_ratio': 0.15, 'penalty': 'elasticnet'}
F1: 0.967 | Pr: 0.955 | Re: 0.979 | AUC: 0.994 | Accuracy: 0.967

Notice that the tuned parameters use both — high values of alpha (indicating large amounts of regularization) as well as elasticnet. These parameter choices are because the small dataset overfits easily.

We can do the same tuning procedure for SVM, Naive Bayes, KNN, RandomForest, and XGBoost. The table below summarizes the results for these (You can refer the GitHub repo for the complete code)

Summary of Hyperparameter Tuning
Summary of Hyperparameter Tuning

Simple MLP

In the course, Jeremy Howard mentions that deep learning has been applied to tabular data quite successfully in many cases. Let’s see how well it performs for our use case:

2-Layer MLP in Keras

y_pred_prob = simple_nn.predict(test_features.todense())
print_model_metrics(y_test, y_pred_prob)

F1: 0.961 | Pr: 0.952 | Re: 0.970 | AUC: 0.992 | Accuracy: 0.960

The 2-layer MLP model works surprisingly well, given the small dataset.

Bagging Classifier

Since SVM worked so well, we can try a bagging classifier by using SVM as a base estimator. This should improve the variance of the base model and reduce overfitting.

from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCVsvm = SVC(C = 10, kernel = 'poly', degree = 2, probability = True, verbose = 0)svm_bag = BaggingClassifier(svm, n_estimators = 200, max_features = 0.9, max_samples = 1.0, bootstrap_features = False, bootstrap = True, n_jobs = 1, verbose = 0), y_train)
y_test_prob = svm_bag.predict_proba(test_features)[:,1]
print_model_metrics(y_test, y_test_prob)Output:
F1: 0.969 | Pr: 0.959 | Re: 0.980 | AUC: 0.995 | Accuracy: 0.969

The performance increase is almost insignificant.

Finally, one last thing we can try is the Stacking Classifier (a.k.a Voting classifier)

Stacking Classifier

This is a weighted average of the predictions of different models. Since we are also using the Keras model we won’t be able to use Sklearn’s VotingClassifier instead we'll just run a simple loop that gets the predictions of each model and runs a weighted average. We’ll use the tuned hyperparameters for each model.

Simple Stacking Classifier

Training LR
Training SVM
Training NB
Training KNN
Training RF
Training XGB
F1: 0.969 | Pr: 0.968 | Re: 0.971 | AUC: 0.995 | Accuracy: 0.969

Now we need a way to select the best weights for each model. The best option is to use an optimization library like Hyperopt that can search for the best combination of weights that maximizes F1-score.

Running Hyperopt for the stacking classifier

Hyperopt finds a set of weights that gives an F1 ~ 0.971. Let’s inspect the optimized weights:

{'KNN': 0.7866810233035141,
'LR': 0.8036572275670447,
'NB': 0.9102009774357307,
'RF': 0.1559824350958057,
'SVM': 0.9355079606348642,
'XGB': 0.33469066125332436,
'simple_nn': 0.000545264707939086}

The low complexity models like Logistic Regression, Naive Bayes and SVM have high weights while non-linear models like Random Forest, XGBoost and the 2 — Layer MLP have much lower weights. This in line with what we had expected i.e. Low complexity and simple models will generalize the best with smaller datasets.

Finally, running the stacking classifier with the optimized weights gives:

F1: 0.971 | Pr: 0.962 | Re: 0.980 | AUC: 0.995 | Accuracy: 0.971

In the next section, we’ll address another concern with small datasets — high dimensional feature spaces.

9. Dimensionality Reduction

As we discussed in the intro, the feature space becomes sparse as we increase the dimensionality of small datasets causing the classifier to easily overfit.

The solution is simply to reduce the dimensionality. Two broad ways to do this are Feature selection and Decomposition.

Feature Selection

These are techniques in which features are selected based on how relevant they are in prediction.


We’ll start with SelectKBest which, as the name suggests, simply selects the k-best features based on the chosen statistic (by default ANOVA F-Scores)

from sklearn.feature_selection import SelectKBestselector = SelectKBest(k = 80)
train_features_selected = selector.fit_transform(train_features, y_train)
test_features_selected = selector.transform(test_features)
run_log_reg(train_features_selected, test_features_selected, y_train, y_test)Output:
F1: 0.958 | Pr: 0.946 | Re: 0.971 | AUC: 0.989 | Accuracy: 0.957

A small problem with SelectKBest is that we need manually specify the number of features we want to keep. An easy way around this is to run a loop that checks the F1 score for each value of K. Here’s a plot of the number of features vs F1 Score:

F1 Scores for different values of K
F1 Scores for different values of K

Approximately 45 features give the best F1 value. Let’s re-run SelectKBest with K = 45 :

selector = SelectKBest(k = 45)
train_features_selected = selector.fit_transform(train_features, y_train)
test_features_selected = selector.transform(test_features)
run_log_reg(train_features_selected, test_features_selected, y_train, y_test, alpha = 1e-2)Output:
F1: 0.972 | Pr: 0.967 | Re: 0.978 | AUC: 0.995 | Accuracy: 0.972

Another option is to use SelectPercentile which uses the percentage of features we want to keep.


Doing the same procedure as above we get percentile = 37 for the best F1 Score. Now using SelectPercentile:

selector = SelectPercentile(percentile = 37)
train_features_selected = selector.fit_transform(train_features, y_train)
test_features_selected = selector.transform(test_features)
run_log_reg(train_features_selected, test_features_selected, y_train, y_test, alpha = 1e-2)Output:
F1: 0.972 | Pr: 0.966 | Re: 0.979 | AUC: 0.995 | Accuracy: 0.972

Simple feature selection increased the F1 score from 0.966 (previous tuned Log Reg model) to 0.972. As mentioned earlier, this is because the lower-dimensional feature space reduces the chances of the model overfitting.

For both techniques, we can also use selector.get_support() to retrieve the names of the features that were selected.

array(['starts_with_number', 'easy_words_ratio', 'stop_words_ratio',
'clickbait_subs_ratio', 'dale_chall_readability_score', 'glove_3',
'glove_4', 'glove_6', 'glove_10', 'glove_14', 'glove_15',
'glove_17', 'glove_19', 'glove_24', 'glove_27', 'glove_31',
'glove_32', 'glove_33', 'glove_35', 'glove_39', 'glove_41',
'glove_44', 'glove_45', 'glove_46', 'glove_49', 'glove_50',
'glove_51', 'glove_56', 'glove_57', 'glove_61', 'glove_65',
'glove_68', 'glove_72', 'glove_74', 'glove_75', 'glove_77',
'glove_80', 'glove_85', 'glove_87', 'glove_90', 'glove_92',
'glove_96', 'glove_97', 'glove_98', 'glove_99'], dtype='<U28')

RFECV (Recursive Features Elimination)

RFE is a backward feature selection technique that uses an estimator to calculate the feature importance at each stage. The word recursive in the name implies that the technique recursively removes features that are not important for classification.

We’ll use the CV variant which uses cross-validation inside each loop to determine how many features to remove in each loop. RFECV needs an estimator which has the feature_importances_ attribute so we'll use SGDClassifier with log loss.

We also need to specify the type of cross-validation technique required. We’ll use the same PredefinedSplit that we used during hyperparameter optimization.

from sklearn.feature_selection import RFECVlog_reg = SGDClassifier(loss = ‘log’, alpha = 1e-3)selector = RFECV(log_reg, scoring = ‘f1’, n_jobs = -1, cv = ps, verbose = 1), y)# Now lets select the best features and check the performance
train_features_selected = selector.transform(train_features)
test_features_selected = selector.transform(test_features)run_log_reg(train_features_selected, test_features_selected, y_train, y_test, alpha = 1e-1)Output:
F1: 0.978 | Pr: 0.970 | Re: 0.986 | AUC: 0.997 | Accuracy: 0.978

Let’s check the features that were selected:

print('Number of features selected:{}'.format(selector.n_features_))
Number of features selected : 60
array(['starts_with_number', 'clickbait_phrases', 'num_dots',
'mean_word_length', 'length_in_chars', 'easy_words_ratio',
'stop_words_ratio', 'contractions_ratio', 'hyperbolic_ratio',
'clickbait_subs_ratio', 'nonclickbait_subs_ratio',
'num_punctuations', 'glove_1', 'glove_2', 'glove_4','glove_6'
'glove_10', 'glove_13', 'glove_14', 'glove_15', 'glove_16',
'glove_17', 'glove_21', 'glove_25', 'glove_27', 'glove_32',
'glove_33', 'glove_35', 'glove_39', 'glove_41', 'glove_43',
'glove_45', 'glove_46', 'glove_47', 'glove_50', 'glove_51',
'glove_52', 'glove_53', 'glove_54', 'glove_56', 'glove_57',
'glove_58', 'glove_61', 'glove_65', 'glove_72', 'glove_74',
'glove_77', 'glove_80', 'glove_84', 'glove_85', 'glove_86',
'glove_87', 'glove_90', 'glove_93', 'glove_94', 'glove_95',
'glove_96', 'glove_97', 'glove_98', 'glove_99'], dtype='<U28')

This time some additional features were selected that gives a slight boost in performance. Since an estimator and CV set is passed, the algorithm has a better way of judging which features to keep.

The other advantage here is that we did not have to mention how many features to keep, RFECV automatically finds that out for us. However, we can mention the minimum number of features we'd like to have which by default is 1.

SFS (Sequential Forward Selection)

Finally, let’s try SFS - which does the same thing as RFE but instead adds features sequentially. SFS starts with 0 features and adds features 1-by-1 in each loop in a greedy manner. One small difference is that SFS solely uses the feature sets performance on the CV set as a metric for selecting the best features, unlike RFE which used model weights (feature_importances_).

# Note: mlxtend provides the SFS Implementation
from mlxtend.feature_selection import SequentialFeatureSelectorlog_reg = SGDClassifier(loss = ‘log’, alpha = 1e-2)selector = SequentialFeatureSelector(log_reg, k_features = ‘best’, floating = True, cv = ps, scoring = ‘f1’, verbose = 1, n_jobs = -1) # k_features = ‘best’ returns the best subset of features, y)train_features_selected = selector.transform(train_features.tocsr())
test_features_selected = selector.transform(test_features.tocsr())run_log_reg(train_features_selected, test_features_selected, y_train, y_test, alpha = 1e-2)Output:
F1: 0.978 | Pr: 0.976 | Re: 0.981 | AUC: 0.997 | Accuracy: 0.978

We can also check the selected features:

print('Features selected {}'.format(len(selector.k_feature_idx_)))
Features selected : 53array(['starts_with_number', 'clickbait_phrases','mean_word_length',
'flesch_kincaid_grade', 'dale_chall_readability_score',
'num_punctuations', 'glove_0', 'glove_1', 'glove_2','glove_4'
'glove_8', 'glove_10', 'glove_13', 'glove_14', 'glove_15',
'glove_16', 'glove_17', 'glove_18', 'glove_25', 'glove_30',
'glove_32', 'glove_33', 'glove_38', 'glove_39', 'glove_40',
'glove_41', 'glove_42', 'glove_45', 'glove_46', 'glove_47',
'glove_48', 'glove_51', 'glove_56', 'glove_57', 'glove_61',
'glove_65', 'glove_67', 'glove_69', 'glove_72', 'glove_73',
'glove_76', 'glove_77', 'glove_80', 'glove_81', 'glove_84',
'glove_85', 'glove_87', 'glove_93', 'glove_95', 'glove_96'],

Forward and backward selection quite often gives the same results. Now let’s take a look at Decomposition techniques.


Unlike feature selection which picks the best features, decomposition techniques factorize the feature matrix to reduce the dimensionality. Since these techniques change the feature space itself, one disadvantage is that we lose model/feature interpretability.

We no longer know what each dimension of the decomposed feature space represents.

Let’s try TruncatedSVD on our feature matrix. The first thing we’ll have to do is find out how the explained variance changes with the number of components.

from sklearn.decomposition import TruncatedSVDsvd = TruncatedSVD(train_features.shape[1] - 1)
Plot to find out n_components for TruncatedSVD
Plot to find out n_components for TruncatedSVD

Looks like just 50 components are enough to explain 100% of the variance in the training set features. This means we have a lot of dependent features (i.e. some features are just linear combinations of other features).

This is in line with what we saw in the feature selection section — even though we have 119 features, most techniques selected between 40–70 features (the remaining features might not be important since they are merely linear combinations of other features).

Now we can reduce the feature matrix to 50 components.

svd = TruncatedSVD(50)
train_featurse_decomposed = svd.fit_transform(train_features)
test_featurse_decomposed = svd.transform(test_features)
run_log_reg(train_featurse_decomposed, test_featurse_decomposed, y_train, y_test, alpha = 1e-1)Output:
F1: 0.965 | Pr: 0.955 | Re: 0.975 | AUC: 0.993 | Accuracy: 0.964

The performance is not as good as the feature selection techniques — Why?

The main job of decomposition techniques, like TruncatedSVD, is to explain the variance in the dataset with a fewer number of components. While doing this, it never considers the importance each feature had in predicting the target (‘clickbait’ or ‘not-clickbait’).

However, in the feature selection techniques, the feature importance or model weights are used each time a feature is removed or added. RFE and SFS in particular select features to optimize for model performance. (You might have noticed we pass ‘y’ in every fit() call in feature selection techniques.)

Stacking Classifier with Feature Selection

Finally, we can use any of the techniques above with the best performing model — Stacking Classifier. We’ll have to retune each model to the reduced feature matrix and run hyperopt again to find the best weights for the stacking classifier.

Now, after using the RFECV selected features and re-tuning:

F1: 0.980 | Pr: 0.976 | Re: 0.984 | AUC: 0.997 | Accuracy: 0.980

10. Summary:

Here’s a summary of all the models and experiments we’ve run so far:

Summary of all experiments
Summary of all experiments

Let’s take a look at the Stacking Classifier’s confusion matrix:

Stacking Classifier Confusion Matrix
Stacking Classifier Confusion Matrix

And here are the top 10 high-confidence misclassified titles:

Title : A Peaking Tiger Woods
Label : not-clickbait
Predicted Probability : 0.7458264596039637
Title : Stress Tests Prove a Sobering Idea
Label : not-clickbait
Predicted Probability : 0.7542456646954389
Title : Woods Returns as He Left: A Winner
Label : not-clickbait
Predicted Probability : 0.7566487248241188
Title : In Baseball, Slow Starts May Not Have Happy Endings
Label : not-clickbait
Predicted Probability : 0.7624898001334597
Title : Ainge Has Heart Attack After Celtics Say Garnett May Miss Playoffs
Label : not-clickbait
Predicted Probability : 0.7784241132465458
Title : Private Jets Lose That Feel-Good Factor
Label : not-clickbait
Predicted Probability : 0.7811035856329488
Title : A Little Rugby With Your Cross-Dressing?
Label : not-clickbait
Predicted Probability : 0.7856236669189782
Title : Smartphone From Dell? Just Maybe
Label : not-clickbait
Predicted Probability : 0.7868008600434597
Title : Cellphone Abilities That Go Untapped
Label : not-clickbait
Predicted Probability : 0.8057172770139488
Title : Darwinism Must Die So That Evolution May Live
Label : not-clickbait
Predicted Probability : 0.8305944075171504

All of the high-confidence misclassified titles are ‘not-clickbait’ and this is reflected in the confusion matrix.

At first glance, these titles seem to be quite different from the conventional news titles. Here’s a randomly chosen sample of ‘not-clickbait’ titles from the test set:

test[test.label.values == 'not-clickbait'].sample(10).title.valuesOutput:
array(['Insurgents Are Said to Capture Somali Town',
'Abducted teen in Florida found',
'As Iraq Stabilizes, China Eyes Its Oil Fields',
'Paramilitary group calls for end to rioting in Northern Ireland',
'Finding Your Way Through a Maze of Smartphones',
'Thousands demand climate change action',
'Paternity Makes Punch Line of Paraguay President',
'Comcast and NFL Network Continue to Haggle',
'Constant Fear and Mob Rule in South Africa Slum',
'Sebastian Vettel wins 2010 Japanese Grand Prix'], dtype=object)

What do you think?

We can try some techniques like Semi-Supervised Pseudo labeling, back-translation, etc to minimize these False Positives but in the interest of blog length, I’ll keep it for another time.

To conclude, by understanding how overfitting works in small datasets along with techniques like feature selection, stacking, tuning, etc we were able to improve performance from F1 = 0.801 to F1 = 0.98 with a mere 50 samples. Not bad!

Feel free to connect with me if you have any questions. I hope you enjoyed!


Created by

Anirudh S







Related Articles