Texts are everywhere, with social media as one of its biggest generators. People are constantly sharing them on many platforms. Rather than letting it be as it is, we can process them into something useful using text mining methods.

One famous application is sentiment analysis where we can identify whether a text’s opinion is positive, negative, or neutral. But here, we’ll talk about another method and making sense of it: text clustering.

As part of unsupervised learning, clustering is used to group similar data points without knowing which cluster the data belong to. So in a sense, text clustering is about how similar texts (or sentences) are grouped together. But how exactly we decide if some texts are similar? How can we tell the machine that the word “tree” is similar to “plant”?

Think of unsupervised learning as a sort of mathematical version of making “birds of a feather flock together.” —
Cassie Kozyrkov

It may be overwhelming for people who haven’t known about text data processing but stay with me, I won’t go into many complex details and just cover the important points for easy understanding.

You can just skip the code section if you’re not a code person. For who’s interested, you can access the full code on GitHub.

Without further ado, here we go!

Let’s get to know the data

We’ll be using open-source data that can be downloaded from Kaggle. Thanks to Dody Agung for creating this Traffic Accident in Indonesia dataset.

The full dataset contains over 150,000+ tweets (language is in Bahasa Indonesia) with the keyword of “kecelakaan” (means accident). It contains the tweet id, the time it was tweeted, the time it was crawled, the username who tweeted it, and the full tweet.

The manually labeled dataset is also provided by the creator, containing 1,000 tweets and their flags whether or not the tweet indicates a real accident. Flag 1 for accident and 0 for non-accident. We’ll attempt text clustering using this labeled dataset. Here’s a look at the data.

import pandas as pd

data = pd.read_csv('twitter_label_manual.csv')
data.head()

Text Pre-processing

As mentioned above, how can we determine if some texts are similar? Computers only calculate numbers, so we translate our texts into numbers!

Before we get into that, we need to cleanse our text. Let’s see why we need to clean them using this sentence as our example:

Driving SAFELY on road is a MUST for each one of us whether driving a 🚌 Bus, 🚗 Car, 🚛 Truck or a 🛵 Two Wheeler..!! 😱😱😱

Filtering & Case Folding

Emojis aren’t text, neither are symbols and special characters, such as “.”, “!”, “~”, etc. We’ll filter those so the data will be pure text.

Case folding is also done because there may be tweets with the word “driving”, “DRIVING”, “dRiVinG”. We’ll just lower case all the texts to make them have the same format.

After we apply filtering and case folding, the sentence will look cleaner:

driving safely on road is a must for each one of us whether driving a bus car truck or a two wheeler

Stemming & Stopword Removal

Stemming (or we can use lemmatization) is the process to reduce a word into its base form. The point here is that the word “drives“, “driving”, “driven” have the same context, so we use its base form “drive”. See the differences between stemming and lemmatization, here.

Another thing is the removal of stopwords like “is”, “and”, “or”, etc. Stopwords refer to the most common words that appear in our language. We see them often and are redundant, so it provides no real information. Hence, we remove them completely.

After we apply stemming and remove the stopwords, the sentence will look like this:

drive safe road must each one us whether driving bus car truck two wheeler

In order to do the mentioned preprocessing, NLTK (Natural Language Toolkit) is the go-to for language processing, but since the language is in Bahasa Indonesia, we use Sastrawi instead.

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

stemmer = StemmerFactory().create_stemmer()
stopwords = StopWordRemoverFactory().get_stop_words()

def text_preprocess(series, stemmer, stopwords):
    df = series.str.replace("\n\t",  " ")
    df = df.str.replace(r"[^a-zA-Z ]+", "")
    df = df.str.lower()
    df = df.apply(lambda x: ' '.join([stemmer.stem(item) for item in x.split() if item not in stopwords]))
    return df

data['processed_text'] = text_preprocess(data['full_text'], stemmer, stopwords)

Word Embedding

This part is the key to an intuitive understanding of how text clustering works. We can easily find the sum, average, and count of a set of integer/number, but what about text? We‘re converting them into numbers in this part.

There are one-hot encoding methods like CountVectorizer and TF-IDF, but we’ll specifically use word embedding on this experiment. Basically, what word embedding do is represent words as vectors in a space where similar words are mapped near each other.

Here’s an example of word vector representations in 3-dimensional space.

To apply word embedding to our dataset, we’ll use the fastText library. They provide the pre-trained model for Indonesian language, but instead, we’ll try to train our own word embedding model using the available 150,000+ tweets as our corpus. I’ve processed the text beforehand and saved it in twitter.txt.

import fasttext

model = fasttext.train_unsupervised('twitter.txt')
data['vec'] = data['processed_text'].apply(lambda x: model.get_sentence_vector(x))

By default, fastText’s train_unsupervised will use the skipgram model and output 100-dimensional vectors. These vectors represent where a tweet is placed within 100 dimensions.

If you noticed that we didn’t tokenize the sentences, the reason is that with get_sentence_vector, it will automatically tokenize them (split the text into pieces). For more details, the model can be learned from here.

FastText also computes the similarity score between words. Using get_nearest_neighbors, we can see the top 10 words that are the most similar along with each similarity score. The closer the score is to 1, the more similar the word with the given word.

Here’s the demonstration from fastText’s website.

model.get_nearest_neighbors(‘accomodation’)

[(0.96342, ’accomodations’), (0.942124, ’accommodation’), (0.915427, ’accommodations’), (0.847751, ’accommodative’), (0.794353, ’accommodating’), (0.740381, ’accomodated’), (0.729746, ’amenities’), (0.725975, ’catering’), (0.703177, ’accomodate’), (0.701426, ’hospitality’)]

Their pre-trained model even knows that the context of accomodation can be catering and hospitality. It’s even more powerful now that you notice that the input is a misspelling of accommodation (yes, it can handle typos).

How about our model? Let’s check it out.

# Motorcycle in Bahasa Indonesia
model.get_nearest_neighbors(‘motor’)

[(0.776196300983429, ‘sepedamotor’), (0.7229066491127014, ‘motor’), (0.7132794260978699, ‘sepeda’), (0.698093056678772, ‘motore’), (0.6889493465423584, ‘motor’), (0.6859809160232544, ‘motorplus’), (0.6410694718360901, ‘nmax’), (0.6405101418495178, ‘bonceng’), (0.6383005976676941, ‘motortabrakan’), (0.6309819221496582, ‘matic’)]

# Car in Bahasa Indonesia
model.get_nearest_neighbors(‘mobil’)

[(0.7426463961601257, ‘ringsek’), (0.7367433905601501, ‘tabrak’), (0.7266382575035095, ‘mobil’), (0.7141972780227661, ‘mobilwow”’), (0.7097604274749756, ‘ringsek’), (0.706925094127655, ‘mobilio’), (0.706623375415802, ‘pajero’), (0.705599844455719, ‘fortuner’), (0.7012485265731812, ‘“pajero’), (0.6936420798301697, ‘mpv’)]

Looking good! If you’re Indonesian, you will clearly say that these words are in the same context. Some words like mobilio, pajero, fortuner, and mpv are actually well-known car model in Indonesia.

Once the training is done, the model will be used to convert each tweet into a 100-dimensional vectors. Here’s an example of a vectorized tweet:

[-0.03824997, 0.00133674, -0.0975338 , 0.07422361, 0.04062992, 0.15320793, 0.0624048 , 0.08707056, -0.04479782, 0.01363136, 0.17272875, -0.03097608, 0.05366326, -0.09492738, 0.06163749, 0.04166117, -0.0779877 , 0.11031814, 0.04414257, -0.04424104, 0.02991617, -0.02359444, 0.08660134, -0.01918944, -0.02529236, -0.06084985, 0.00374846, 0.07403581, 0.03064661, 0.0105409 , 0.02821296, -0.08867718, -0.00845077, -0.04583884, -0.03845499, -0.04432626, 0.08085568, 0.0762938 , -0.03690336, 0.00286471, 0.05640269, 0.08347917, -0.12400634, 0.06856565, 0.09385975, 0.07298957, -0.03306708, 0.07894476, -0.03820109, -0.05187325, -0.08153208, -0.05167899, -0.07915987, 0.05901144, 0.00445149, -0.14628977, 0.04536996, 0.12275991, 0.14212511, -0.04074997, 0.04834579, 0.1293375 , 0.13116567, 0.10201992, -0.1010689 , -0.01407889, -0.01707099, 0.13866977, 0.03039356, 0.08307764, 0.06886553, 0.08681376, 0.02241692, -0.0974027 , -0.02969944, -0.06031594, 0.07977851, 0.09534364, -0.0803275 , -0.18087131, 0.00296218, 0.06247464, -0.00784681, -0.0209177 , 0.10568991, -0.06968653, -0.07200669, 0.06571897, 0.01448524, 0.15396708, 0.00435031, 0.02272239, 0.05981111, -0.03069473, -0.11629239, -0.11808605, -0.01497007, -0.00028591, 0.02116462, -0.11837215]

That’s a bunch of numbers, isn’t it? Don’t try to visualize 100 dimensions in your head, okay? Humans can’t (at least for now?), but our computers can handle that just fine! Now that we have created the word vectors, how can we cluster similar tweets together?

Text Clustering

For a refresh, clustering is an unsupervised learning algorithm to cluster data into k groups (usually the number is predefined by us) without actually knowing which cluster the data belong to. The clustering algorithm will try to learn the pattern by itself. We’ll be using the most widely used algorithm for clustering: K-means. This algorithm can cluster tweets based on their distance with the cluster centroid.

# Number of cluster chosen is 3 based on Elbow Method, please check the full code for it
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
data['cluster'] = kmeans.fit_predict(data['vec'].values.tolist())

We have stored the cluster output where a tweet belongs to. Let’s try to plot it, but how can we make a visualization of 100 dimensions?

Principal Component Analysis (PCA) comes to the rescue. It’s a commonly used dimension reduction technique. Not all data are representative to the full picture, some might explain something while some others don’t. The idea behind PCA is to extract the principal component(s) of the data. These principal component(s) can be used to visualize our data since they represent most of our data.

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
data['x'] = pca.fit_transform(data['vec'].values.tolist())[:,0]
data['y'] = pca.fit_transform(data['vec'].values.tolist())[:,1]
data['z'] = pca.fit_transform(data['vec'].values.tolist())[:,2]

Great, now let’s visualize it in a 2D scatterplot.

Some data points are overlapping with each other. What if we project it in 3D?

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1, figsize=(10,10))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
ax.scatter(data['x'],data['y'],data['z'], c=data['cluster'], cmap='rainbow')
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
ax.set_facecolor('white')
plt.title("Tweet Clustering using K Means", fontsize=14)

Beautiful!

But wait,

Did some of you notice that we could’ve just built a 3 dimensional word embedding model instead of a 100-dimensional model then use PCA on it? Let’s try that.

But, how can we compare which word embedding model can cluster similar tweets the better? This is the reason why we use the manually labeled dataset to predict the clusters. We can check whether or not real accident tweets are grouped and not in the same cluster with non-accident tweets.

Experiment Result

Let’s see the label proportion of each cluster. The first figure used 3D word embedding while the second one used the default 100 dimensions. There’s a big difference in how cluster 0 is created.

Using 3D word embedding, cluster 0 correctly grouped 87% of the accident, and cluster 2 correctly grouped 97% of non-accident.

Using 100D word embedding, cluster 0 correctly grouped ~99% of the accident, while cluster 2 correctly grouped 97.5% of non-accident.

With more dimensions, the word embedding model can capture more information and generates better cluster grouping. The visualization using PCA is just for intuitive understanding. The capability to accurately cluster texts with the same properties/characteristics is more preferred!

Now, let’s see if each cluster has some kind of characteristic. We’ll see some samples from each cluster. I’ll describe each cluster’s characteristics for non-Indonesian people (or you can use a translator).

Cluster 0 — Accident — Red Points

Tweet characteristic: live reports from trusted users.

‘23.27: @PTJASAMARGA : Kunciran KM 14 — KM 16 arah Bitung PADAT, ada penanganan kecelakaan kendaraan truk fuso di bahu jalan.’,
‘20.35 WIB #Tol_Japek Karawang Timur KM 51 — KM 52 arah Cikampek PADAT, ada Evakuasi Kecelakaan Kendaraan Truk di lajur 1/kiri dan bahu jalan.’,
‘♻️ @SenkomCMNP: 5:09 Wib. Kendaraan Truk Tangki Pertamina yang mengalami Kecelakaan di KM 16+600. Masih Penanganan Petugas. Lajur 1 dan 2 Sudah bisa di lewati.(uda) @SonoraFM92 @RadioElshinta https://t.co/9QqgdoBzQW',

Cluster 1 — Green Cluster — Random

Tweet characteristic: random tweets by random users, usually telling their personal story.

‘Plot twist: Ibunya abis kecelakaan, nemenin ke UGD dan baru bisa ditinggal.\n\nTapi ya ga masalah sih. Yg penting kan eTikA pRofEsiOnaL. https://t.co/0e2zHaMkCo',
‘Dapet video kejadian kecelakaan tunggal di Margonda, Depok tadi pagi.. Ya Allah.. sedih liatnya.. \n\nSudah biasa liat yg ky gt.. Ga serem, tp sedih iya.. Turut berduka.. :(\n\nBuat kalian yg bawa kendaraan, jgn lupa berdoa sebelum bepergian, hati2 dan patuhi rambu yaa..’,
‘👦: “Abis kecelakaan dimna lo?”\n\n👧: “ Gue gk kecelakaan kok, aman2 aja”\n\n👦: “ trus itu knapa muka lo ancur”\n\nSABAR. Muka jelek emang banyak cobaan’,

Cluster 2 — Purple Cluster — Not accident

Tweet characteristic: containing news and information (not about real-time accident, but events of the past)

‘Rekaman CCTV Kecelakaan Motor di PIK, depan Taman Grisenda :\nhttps://t.co/gMHLep9IvZ mhmmdrhmtrmdhn\nVisit Wonderful #MRahmatRamadhan’,
‘Tewaskan 346 Orang dalam 2 Kecelakaan, Boss Boeing Minta Maaf https://t.co/wLRhFy8oYE',
‘Anggota parlemen Taiwan juga berencana meningkatkan denda maksimum dan masa hukuman bagi orang yang menyetir dalam keadaan mabuk. https://t.co/GSWqziaKDN',

So, what’s next?

Remember that we didn’t feed the flag data to the model, but half of the data were correctly grouped (~99%) into bins of real accident and non-accident. Knowing this, we can use clustering methods to label the unlabelled data. The original purpose of the labeled dataset is to solve a classification problem (supervised learning), but as we can see, the clustering technique can be used to enrich it even more.

“We don’t have better algorithms. We just have more data.” —
Peter Norvig

Sometimes, more data (quality data, of course) are much more useful than the improvement of an algorithm. With more and better data, even a simple algorithm can give great results. We can also see this phenomenon during the experiment where we use 3D vs 100D word embedding model, right?

Here’s an analogy from me:

A good racing car may be fast enough to win your race, but you can only have better results with a better car.
Your driving skill is the algorithm, and the car is the data.

So, I hope for whoever reading this may get more sense on how we process text data and use it for text clustering. Feel free to comment or maybe point out some mistakes if there’s any.

Thank you! Stay safe and healthy, folks.