cft

Modeling for Unbalanced Datasets: Tips and Strategies

When dealing with unbalanced datasets in machine learning, it's important to approach the problem carefully. Unbalanced datasets can occur when one class or group of data points dominates the other classes. This can cause issues when training machine learning models, because the model may be overly biased towards the dominant class.


user

Emine Bozkuş

13 days ago | 15 min read

When dealing with unbalanced datasets in machine learning, it’s important to approach the problem carefully. Unbalanced datasets can occur when one class or group of data points dominates the other classes. This can cause issues when training machine learning models, because the model may be overly biased towards the dominant class. 

Introduction

Unbalanced datasets refer to instances where the number of examples in one class is significantly different from the number of examples in another class. For example, in a binary classification problem where one class represents a small percentage of the total examples, this is known as an imbalanced dataset. In such scenarios, the class imbalance can lead to several issues while building a machine learning model.

One of the major issues with unbalanced datasets is that the model may be biased towards the majority class, leading to poor performance in predicting the minority class. This is because the model is trained to minimize the error rate, and when the majority class is over-represented, the model tends to predict the majority class more often. This leads to a higher accuracy score, but a poor recall and precision score for the minority class.

Another issue is that the model may not generalize well when exposed to new, unseen data. This is because the model is trained on a skewed dataset and may not be able to handle the imbalance in the test data.

In this article, we will discuss various tips and strategies to handle unbalanced datasets and improve the performance of machine learning models. Some of the techniques that will be covered include resampling techniques, cost-sensitive learning, using appropriate performance metrics, ensemble methods, and other strategies. By following these guidelines, we can build effective models for unbalanced datasets.

Tips for handling unbalanced datasets

Resampling techniques are one of the most popular ways to handle unbalanced datasets. These techniques involve either reducing the number of examples in the majority class or increasing the number of examples in the minority class.

Undersampling is a technique where we randomly remove examples from the majority class to reduce its size and balance the dataset. This technique is simple and easy to implement, but it can lead to information loss as it discards some of the majority class examples.

Oversampling is the opposite of undersampling, where we randomly replicate examples from the minority class to increase its size. This technique can lead to overfitting as the model is trained on repeated examples of the minority class.

SMOTE (Synthetic Minority Over-sampling Technique) is a more advanced technique that creates synthetic examples of the minority class rather than replicating existing examples. This technique helps to balance the dataset without introducing duplicates.

Cost-sensitive learning is another technique that can be used to handle unbalanced datasets. In this approach, different misclassification costs are assigned to different classes. This means that the model is penalized more heavily for misclassifying examples from the minority class than for misclassifying examples from the majority class.

Using appropriate performance metrics is also important when working with unbalanced datasets. Accuracy is not always the best metric as it can be misleading when dealing with unbalanced datasets. Instead, using a metric such as AUC-ROC (Area Under the Receiver Operating Characteristic Curve) can provide a better indication of model performance.

Ensemble methods, such as bagging and boosting, can also be effective for modeling unbalanced datasets. These methods combine the predictions of multiple models to improve the overall performance. Bagging involves training multiple models independently and averaging their predictions, while boosting involves training multiple models sequentially, where each model attempts to correct the errors of the previous model.

In summary, resampling techniques, cost-sensitive learning, using appropriate performance metrics, and ensemble methods are some of the tips and strategies that can help to handle unbalanced datasets and improve the performance of machine learning models.

Strategies for improving model performance on unbalanced datasets

Collecting more data is one of the most straightforward strategies for improving model performance on unbalanced datasets. By increasing the number of examples in the minority class, the model will have more information to learn from and will be less likely to be biased towards the majority class. This strategy is especially useful when the number of examples in the minority class is very small.

Generating synthetic samples is another strategy that can be used to improve model performance. Synthetic samples are artificially created examples that are similar to the real examples in the minority class. These samples can be generated using techniques such as SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic examples by interpolating between existing examples. Generating synthetic samples can help to balance the dataset and provide more examples for the model to learn from.

Using domain knowledge to focus on important samples is a strategy that can be used to improve model performance by identifying the most informative examples in the dataset. For example, if we are working on a medical dataset, we may know that certain symptoms or lab results are more indicative of a certain disease. By focusing on these examples, we can improve the model's ability to accurately predict the minority class.

Finally, advanced techniques such as anomaly detection can be used to identify and focus on the minority class examples. These techniques can be used to identify examples that are different from the majority class and are likely to be minority class examples. This can help to improve model performance by identifying the most informative examples in the dataset.

In summary, collecting more data, generating synthetic samples, using domain knowledge to focus on important samples and using advanced techniques such as anomaly detection are some of the strategies that can be used to improve model performance on unbalanced datasets. These strategies can help to balance the dataset, provide more examples for the model to learn from, and identify the most informative examples in the dataset.


Let's practise in python.

Import Libraries

import pandas as pd

import numpy as np

from sklearn.preprocessing import RobustScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.metrics import confusion_matrix, classification_report,f1_score,recall_score,roc_auc_score, roc_curve

import matplotlib.pyplot as plt

import seaborn as sns

from matplotlib import rc,rcParams

import itertools

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

warnings.filterwarnings("ignore", category=FutureWarning)

warnings.filterwarnings("ignore", category=UserWarning)

Read data

# Reading the dataset

df = pd.read_csv("creditcard.csv")

# Number of variables and observations in the data set

print("Number of observations : " ,len(df))

print("Number of variables : ", len(df.columns))

Number of observations : 284807

Number of variables : 31

# Observe the types of variables in the dataset and whether they contain nulls

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 284807 entries, 0 to 284806

Data columns (total 31 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Time 284807 non-null float64

1 V1 284807 non-null float64

2 V2 284807 non-null float64

3 V3 284807 non-null float64

4 V4 284807 non-null float64

5 V5 284807 non-null float64

6 V6 284807 non-null float64

7 V7 284807 non-null float64

8 V8 284807 non-null float64

9 V9 284807 non-null float64

10 V10 284807 non-null float64

11 V11 284807 non-null float64

12 V12 284807 non-null float64

13 V13 284807 non-null float64

14 V14 284807 non-null float64

15 V15 284807 non-null float64

16 V16 284807 non-null float64

17 V17 284807 non-null float64

18 V18 284807 non-null float64

19 V19 284807 non-null float64

20 V20 284807 non-null float64

21 V21 284807 non-null float64

22 V22 284807 non-null float64

23 V23 284807 non-null float64

24 V24 284807 non-null float64

25 V25 284807 non-null float64

26 V26 284807 non-null float64

27 V27 284807 non-null float64

28 V28 284807 non-null float64

29 Amount 284807 non-null float64

30 Class 284807 non-null int64

dtypes: float64(30), int64(1)

memory usage: 67.4 MB

f,ax=plt.subplots(1,2,figsize=(18,6))

df['Class'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)

ax[0].set_title('distribution')

ax[0].set_ylabel('')

sns.countplot('Class',data=df,ax=ax[1])

ax[1].set_title('Class')

plt.show()

print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')

print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')

No Frauds 99.83 % of the dataset

Frauds 0.17 % of the dataset

# Standardizing Time and Amount variables

rob_scaler = RobustScaler()

df['Amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))

df['Time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df.head()

# We apply the hold out method and divide the data set into training and testing (80%, 20%).

X = df.drop("Class", axis=1)

y = df["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)

# defining and training the model and its success score

model = LogisticRegression(random_state=2)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.3f"%(accuracy))

Accuracy: 0.999

Accuracy is the ratio of correct predictions to all predictions in the system. The accuracy score of the model we created is 0.999. We can say that our model works perfectly, right?

Let's take a look at the Confusion Matrix to examine its performance.

Confusion Matrix

The Confusion Matrix is a table used to describe the performance of a classification model's true values on test data. It contains 4 different combinations of estimated and actual values.

Terminology and derivations from a confusion matrix

  • condition positive (P): the number of real positive cases in the data
  • condition negative (N): the number of real negative cases in the data
  • true positive (TP): A test result that correctly indicates the presence of a condition or characteristic
  • true negative (TN): A test result that correctly indicates the absence of a condition or characteristic
  • false positive (FP): A test result which wrongly indicates that a particular condition or attribute is present
  • false negative (FN): A test result which wrongly indicates that a particular condition or attribute is absent

def plot_confusion_matrix(cm, classes,

title='Confusion matrix',

cmap=plt.cm.viridis):

plt.rcParams.update({'font.size': 20})

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title,fontdict={'size':'18'})

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=45,fontsize=12,color="blue")

plt.yticks(tick_marks, classes,fontsize=12,color="blue")

rc('font', weight='bold')

fmt = '.1f'

thresh = cm.max()

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, format(cm[i, j], fmt),

horizontalalignment="center",

color="red")

plt.ylabel('True label',fontdict={'size':'12'})

plt.xlabel('Predicted label',fontdict={'size':'12'})

plt.tight_layout()

plot_confusion_matrix(confusion_matrix(y_test, y_pred=y_pred), classes=['Non Fraud','Fraud'],

title='Confusion matrix')

• A total of 56875 predictions were made for the Non Fraud class, with 56870 (TP) correct and 5 (FP) incorrect.

• A total of 87 predictions were made for the Fraud class, with 31 (FN) incorrect and 56 (TN) correct.

The model tells us that it can predict the Fraud state with 0.99 accuracies. But when we examine the confusion Matrix, the rate of false predictions in the Fraud class is quite high. While it is good at predicting the majority class, it is not good at predicting the minority class. In other words, the model correctly predicts the non-fraud class with a rate of 0.99. The fact that the number of observations belonging to the Non-Fraud class is higher than the number of observations of the fraud class causes the model to be successful in estimating the Non-Fraud class. With this observation we have made, we can say that the accuracy score is not a good performance measure for classification models, especially if it contains imbalances like the data set we have. We've looked at the dataset, we can look at how we can deal with the imbalance and build a model, what methods can be applied, and with what metrics we can measure its performance.

Choosing the Right Metric

When dealing with imbalanced datasets, it is important to choose the right metric to evaluate the performance of the model. Traditional metrics, such as accuracy, precision, and recall, may not be appropriate for imbalanced datasets because they do not take into account the distribution of the classes in the data.

One metric that is often used for imbalanced datasets is the F1 score. The F1 score is the harmonic mean of precision and recall, and it provides a balance between the two metrics. It is calculated as follows:

F1 = 2 * (precision * recall) / (precision + recall)

Another metric that is often used for imbalanced datasets is the area under the receiver operating characteristic curve (AUC-ROC). The AUC-ROC is a measure of the model's ability to distinguish between positive and negative classes. It is calculated by plotting the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. AUC-ROC values range from 0.5 (random guessing) to 1.0 (perfect classification).

#classification report

print(classification_report(y_test, y_pred))

Let's examine the Precision measure for each class.

Returns how many of the predictions made for the 0 (non-fraud) class are correct. Looking at the confusion matrix, 56870 + 31 = 56901 non-fraud class predictions were made and 56870 of them were estimated correctly. The precision value for class 0 is 1 (56870 / 56901)

Returns how many of the predictions for the 1 (fraud) class are correct. Looking at the confusion matrix, 5 + 56 = 61 fraud class predictions were made and 56 of them were correctly estimated. The precision for class 0 is 0.92 (56 / 61)

Let's examine the Recall measure for each class.

It shows how many of the values we need to predict for the 0(non-fraud) class are correctly predicted. We have 56870 + 5 = 56875 observations belonging to the non-fraud class and 56870 of them have been estimated correctly. The recall value for class 0 is 56870 / 56875 = 1.

It shows how many of the values we need to predict for the 1(fraud) class correctly. 31 + 56 = We have observations of 87 fraud classes and 56 of them are correctly estimated. The recall value for class 1 is 56 / 87 = 0.64.

When we look at the Recall values, we can easily see the failure of estimating 1 class. F1-score also expresses the harmonic mean of the recall and precision values.

Support refers to the number of actual values of the classes. We can show the structural weaknesses of the measurements, that is, we can say that the imbalance in the number of observations between the classes affects the measurements.

ROC curve

A receiver operating characteristic (ROC) curve is a graphical plot that shows the performance of a binary classification model at all classification thresholds. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The true positive rate is the proportion of positive samples that are correctly classified as positive. It is calculated as the number of true positives divided by the number of actual positives. The false positive rate is the proportion of negative samples that are incorrectly classified as positive. It is calculated as the number of false positives divided by the number of actual negatives.

AUC (Area under the ROC curve)

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

# Auc Roc Curve

def generate_auc_roc_curve(clf, X_test):

y_pred_proba = clf.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

auc = roc_auc_score(y_test, y_pred_proba)

plt.plot(fpr,tpr)

plt.show()

pass

generate_auc_roc_curve(model, X_test)

y_pred_proba = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_pred_proba)

print("AUC ROC Curve with Area Under the curve = %.3f"%auc)

AUC ROC Curve with Area Under the curve = 0.985

NOTE: The methods should be applied to the training set. Correct evaluation cannot be made if it is applied to the test set.

Resampling

Oversampling

Stabilizes the dataset by duplicating minority-class samples.

Random Oversampling:

It is the balancing of the data set by adding randomly selected samples from the minority class. This technique can be used if your dataset is small. It may cause overfitting. The RandomOverSampler method takes the sampling_strategy argument, and when sampling_stratefy='minority' is called, it increments the number of the minority class to equal the number of the majority class.

We can enter a float value in this argument. For example, let the number of our minority class be 1000 and the number of the majority class be 100. If we say sampling_stratefy = 0.5, the minority class will be added to be 500.

# Number of classes in the training set before random oversampling

y_train.value_counts()

0 227437

1 408

Name: Class, dtype: int64

# Implementing RandomOver Sampling (Applying to the training set)

from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy='minority')

X_randomover, y_randomover = oversample.fit_resample(X_train, y_train)

# Number of classes of the training set after random oversampling

y_randomover.value_counts()

0 227437

1 227437

Name: Class, dtype: int64

# training the model and its success rate

model.fit(X_randomover, y_randomover)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.3f%%" % (accuracy))

Accuracy: 0.977%

plot_confusion_matrix(confusion_matrix(y_test, y_pred=y_pred), classes=['Non Fraud','Fraud'],

title='Confusion matrix')

#classification report

print(classification_report(y_test, y_pred))

After applying Random Oversampling, the accuracy value of the trained model is 0.97, a decrease is observed. Looking at the Confusion Matrix and Classification report, the predicted fraud classes seem to have a high false rate, which lowered the precision of 1 class. But there is also an increase in the recall value of the 1 class, the rate of correctly estimating the fraud class of the model has increased. According to the first model, the prediction success of the Non fraud class has decreased, but the increase in the correct estimation of the fraud class is a big factor in our preference for the model created after randomoversampling.

SMOTE Oversampling:

Generating minority class synthetic samples to prevent overfitting.

First, a random sample from the minority class is selected. Then, k nearest neighbors are found for this sample. One of the k nearest neighbors is randomly selected and the synthetic sample is formed by combining it with the randomly selected sample from the minority class and forming a line segment in the feature space.

# Number of classes in the training set before smote

y_train.value_counts()

0 227437

1 408

Name: Class, dtype: int64

# Applying Smote (Applying to the training set)

from imblearn.over_sampling import SMOTE

oversample = SMOTE()

X_smote, y_smote = oversample.fit_resample(X_train, y_train)

# class number of training set after smote

y_smote.value_counts()

0 227437

1 227437

Name: Class, dtype: int64

# training the model and its success rate

model.fit(X_smote, y_smote)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy: %.3f%%" % (accuracy))

Accuracy: 0.975%

plot_confusion_matrix(confusion_matrix(y_test, y_pred=y_pred), classes=['Non Fraud','Fraud'],

title='Confusion matrix')

from sklearn.metrics import confusion_matrix

from matplotlib import pyplot as plt

conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)

print('Confusion matrix:\n', conf_mat)

labels = ['Class 0', 'Class 1']

fig = plt.figure()

ax = fig.add_subplot(111)

cax = ax.matshow(conf_mat, cmap=plt.cm.Blues)

fig.colorbar(cax)

ax.set_xticklabels([''] + labels)

ax.set_yticklabels([''] + labels)

plt.xlabel('Predicted')

plt.ylabel('Expected')

plt.show()

Confusion matrix:

[[55466 1412]

[ 8 76]]

#classification report

print(classification_report(y_test, y_pred))

Undersampling

It is a technique of balancing the data set by removing samples belonging to the majority class. It can be used when you have a large data set. Since the data set we have is not large, efficient results will not be obtained. But I will explain the methods and show how some of them can be applied.

Random Undersampling:

  • Extracted samples are randomly selected.
  • You can use this technique if you have a large data set.
  • Information may be lost due to random selection.

# Number of classes in the training set before random undersampling

y_train.value_counts()

0 227437

1 408

Name: Class, dtype: int64

from imblearn.under_sampling import RandomUnderSampler

# transform the dataset

ranUnSample = RandomUnderSampler()

X_ranUnSample, y_ranUnSample = ranUnSample.fit_resample(X_train, y_train)

# After Random undersampling

y_ranUnSample.value_counts()

0 408

1 408

Name: Class, dtype: int64

Change the algorithm

from sklearn.ensemble import RandomForestClassifier

# train model

rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set

rfc_pred = rfc.predict(X_test)

accuracy_score(y_test, rfc_pred)

0.9996313331694814

# f1 score

f1_score(y_test, rfc_pred)

0.8645161290322582

# confusion matrix

pd.DataFrame(confusion_matrix(y_test, rfc_pred))

# recall score

recall_score(y_test, rfc_pred)

0.7976190476190477

Conclusion

In this article, we discussed various tips and strategies for handling unbalanced datasets and improving the performance of machine learning models. Unbalanced datasets can be a common problem in machine learning and can lead to poor performance in predicting the minority class.

We discussed various resampling techniques such as undersampling, oversampling, and SMOTE that can be used to balance the dataset. We also discussed cost-sensitive learning and using appropriate performance metrics such as AUC-ROC, which can provide a better indication of model performance. Additionally, we discussed ensemble methods such as bagging and boosting that can also be effective for modeling unbalanced datasets.

We also discussed strategies for improving model performance on unbalanced datasets, such as collecting more data, generating synthetic samples, using domain knowledge to focus on important samples, and using advanced techniques such as anomaly detection.

In summary, unbalanced datasets can be challenging to work with, but by following the tips and strategies discussed in this article, we can build effective models that can accurately predict the minority class. It's important to remember that the best approach will depend on the specific dataset and problem, and that a combination of techniques may be necessary to achieve the best results. Therefore, it's important to experiment with different techniques and evaluate their performance using appropriate metrics.

References

  1. https://imbalanced-learn.org/stable/introduction.html
  2. https://courses.miuul.com/p/machine-learning
  3. https://imbalanced-learn.org/dev/references/over_sampling.html

Upvote


user
Created by

Emine Bozkuş

Data Scientist

PhD Researcher & Data Scientist


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles