cft

Unconventional Sentiment Analysis: BERT vs. Catboost

Getting the Data


user

Taras Baranyuk

3 years ago | 6 min read

Sentiment analysis is fundamental, as it helps to understand the emotional tones within language. This, in turn, helps to automatically sort the opinions behind reviews, social media discussions, etc., allowing you to make faster, more accurate decisions.

Although sentiment analysis has become extremely popular in recent times, work on it has been progressing since the early 2000s.

Traditional machine learning methods such as Naive Bayesian, Logistic Regression, and Support Vector Machines (SVMs) are widely used for large-scale sentiment analysis because they scale well.

Deep learning (DL) techniques have now been proven to provide better accuracy for various NLP tasks, including sentiment analysis; however, they tend to be slower and more expensive to learn and use.

In this story, I want to offer a little-known alternative that combines speed and quality. For conclusions and assessments of the proposed method, I need a baseline model. I chose the time-tested and popular BERT.

Getting the Data

Social media is a source that produces a massive amount of data on an unprecedented scale. The dataset I will be using for this story is Coronavirus tweets NLP

As I can see, there is not so much data for the model, and at first glance, it seems that one cannot do without a pre-trained model.

Due to the small number of samples for training, I reduce the number of classes to 3 by combining them.

Baseline BERT Model

Let’s use TensorFlow Hub. TensorFlow Hub is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. You can use trained models like BERT and Faster R-CNN with just a few lines of code.

!pip install tensorflow_hub
!pip install tensorflow_text

small_bert/bert_en_uncased_L-4_H-512_A-8 — Smaller BERT model.

This is one of the smaller BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.

The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where a larger and more accurate teacher produces the fine-tuning labels.

bert_en_uncased_preprocess — Text preprocessing for BERT. This model uses a vocabulary for English extracted from Wikipedia and BooksCorpus. Text inputs have been normalized the “uncased” way, meaning that the text has been lower-cased before tokenization into word pieces, and any accent markers have been stripped.

tfhub_handle_encoder = \
"https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1"
tfhub_handle_preprocess = \
"https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

I will not make the selection of parameters and optimization in order not to complicate the code. All the same, this is the baseline model, not SOTA.

def build_classifier_model():

text_input = tf.keras.layers.Input(
shape=(), dtype=tf.string, name='text')

preprocessing_layer = hub.KerasLayer(
tfhub_handle_preprocess, name='preprocessing')

encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(
tfhub_handle_encoder, trainable=True, name='BERT_encoder')

outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(
3, activation='softmax', name='classifier')(net)
model = tf.keras.Model(text_input, net)

loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.metrics.CategoricalAccuracy('accuracy')
optimizer = Adam(
learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
model.compile(
optimizer=optimizer, loss=loss, metrics=metric)
model.summary()
return model

I have created a model with just under 30M parameters.

I allocated 30 percent of the train data for model validation.

train, valid = train_test_split(
df_train,
train_size=0.7,
random_state=0,
stratify=df_train['Sentiment'])
y_train, X_train = \
train['Sentiment'], train.drop(['Sentiment'], axis=1)
y_valid, X_valid = \
valid['Sentiment'], valid.drop(['Sentiment'], axis=1)
y_train_c = tf.keras.utils.to_categorical(
y_train.astype('category').cat.codes.values, num_classes=3)
y_valid_c = tf.keras.utils.to_categorical(
y_valid.astype('category').cat.codes.values, num_classes=3)

The number of epochs was chosen intuitively and did not require justification :)

history = classifier_model.fit(
x=X_train['Tweet'].values,
y=y_train_c,
validation_data=(X_valid['Tweet'].values, y_valid_c),
epochs=5)
BERT Accuracy: 0.833859920501709

Confusion Matrix:

Classification Report:

Here I have the baseline model. Obviously, I can improve this model further. But let’s leave this task as your homework.

CatBoost Model

CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box.

The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. For those who value inference speed — CatBoost predictions are 20 to 40 times faster than other open-source gradient boosting libraries, making CatBoost useful for latency-critical tasks.

!pip install catboost

I will not select the optimal parameters; let that be your other homework. Let’s write a function to initialize and train the model.

def fit_model(train_pool, test_pool, **kwargs):
model = CatBoostClassifier(
task_type='GPU',
iterations=5000,
eval_metric='Accuracy',
od_type='Iter',
od_wait=500,
**kwargs
)
return model.fit(
train_pool,
eval_set=test_pool,
verbose=100,
plot=True,
use_best_model=True)

When working with CatBoost, I recommend using a Pool. The Pool is a convenience wrapper combining features, labels, and further metadata like categorical and text features.

train_pool = Pool(
data=X_train,
label=y_train,
text_features=['Tweet']
)
valid_pool = Pool(
data=X_valid,
label=y_valid,
text_features=['Tweet']
)

text_features — A one-dimensional array of text columns indices (specified as integers) or names (specified as strings). Use only if the data parameter is a two-dimensional feature matrix (has one of the following types: list, numpy.ndarray, pandas.DataFrame, pandas.Series).

If any elements in this array are specified as names instead of indices, names for all columns must be provided. To do this, either use the feature_names parameter of this constructor to explicitly specify them or pass a pandas.DataFrame with column names specified in the data parameter.

Supported training parameters

  1. tokenizers — Tokenizers are used to preprocess Text type feature columns before creating the dictionary.
  2. dictionaries — Dictionaries used to preprocess Text type feature columns.
  3. feature_calcers — Feature calcers used to calculate new features based on preprocessed Text type feature columns.

I set all the parameters intuitively; tuning them will be your homework again.

model = fit_model(
train_pool, valid_pool,
learning_rate=0.35,
tokenizers=[
{
'tokenizer_id': 'Sense',
'separator_type': 'BySense',
'lowercasing': 'True',
'token_types':['Word', 'Number', 'SentenceBreak'],
'sub_tokens_policy':'SeveralTokens'
}
],
dictionaries = [
{
'dictionary_id': 'Word',
'max_dictionary_size': '50000'
}
],
feature_calcers = [
'BoW:top_tokens_count=10000'
]
)

Accuracy:

Loss:

CatBoost model accuracy: 0.8299104791995787

Confusion Matrix:

Classification Report:

The result is very close to what the baseline BERT model has shown. Because I have very little data for training, and the model was taught from scratch, the result is, in my opinion, impressive.

Bonus

I got two models with very similar results. Can this give us anything else useful? Both models have little in common at their core, which means that their combination should give a synergistic effect. The easiest way to test this conclusion is to average the result and see what happens.

y_proba_avg = np.argmax((y_proba_cb + y_proba_bert)/2, axis=1)

The gain is impressive.

Average accuracy: 0.855713533438652

Confusion Matrix:

Classification Report:

Summary

In this story, I:

  1. created a baseline model using BERT;
  2. created a model with CatBoost using built-in text capabilities;
  3. looked at what happens if average the result from both models.

In my opinion, complex and slow SOTAs can be avoided in most cases, especially if speed is a critical need.

CatBoost provides great sentiment analysis capabilities right out of the box. For competition lovers like Kaggle, DrivenData, etc., CatBoost can provide a good model both as a baseline solution and as a part of an ensemble of models.

The code from the story can be viewed here.

Upvote


user
Created by

Taras Baranyuk

17+ years of experience in creating software products in various positions (a developer, team lead, product manager, CTO, and co-owner of a successfully sold startup). 10+ years of immersion in data science and machine learning.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles