What is Clustering Algorithm?

In a business context: Clustering algorithm is a technique that assists customer segmentation which is the process of classifying similar customers into the same segment. Clustering algorithm helps to better understand customers, in terms of both static demographics and dynamic behaviors.

Customer with comparable characteristics often interact with the business similarly, thus business can benefit from this technique by creating tailored marketing strategy for each segment.

In a data science context: Clustering algorithm is an unsupervised machine learning algorithm that discovers groups of data points that are closely related. The fundamental difference between supervised and unsupervised algorithm is that:

supervised algorithm: it requires partitioning the dataset into train and test set, and the algorithm learned based on the output/label of the train dataset and generalize it to unobserved data. For instance, decision tree, regression, neural networks.
unsupervised algorithm: it is used to discover hidden patterns when there isn't any defined output/label from the dataset. For instance, clustering, association rule mining, dimension reduction.

Prepare Data for Clustering

After giving an overview of what is clustering, let's delve deeper into an actual Customer Data example. I am using the Kaggle dataset "Mall Customer Segmentation Data", and there are five fields in the dataset, ID, age, gender, income and spending score.

What the mall is most concerned about are customers' spending scores, hence the objective of the clustering algorithm is to find hidden clusters in respects of the field spending score.

1. Load and Preview Data

Load the dataset and summarize column statistics using describe().

Examine the distribution of each field: use bar chart for categorical variables and histogram for numeric variables.

If you would like more details about data visualization and EDA, please check out these two articles.

"Feature Selection and EDA in Machine Learning"

"Semi-Automated Exploratory Data Analysis (EDA) in Python"

2. Data Preprocessing

Now that we define the objectives and have a better understanding of our data, we need to preprocess the data to meet the model requirement. In this example, I am choosing K Means clustering as the main algorithm.

The standard k means algorithm is not directly applicable to categorical data. To be more specific, the search space of categorical variables (e.g. Gender in this example) is discrete (either male or female), hence cannot be directly combined with a continuous space and measured the distance in the same manner. Therefore I discarded the "Gender" field as well as "CustomerID" field.

Since k means interpret the closeness between data points based on Euclidean distance, it is important to reconcile all dimensions into the standard scale. An appropriate type of data transformation should be selected to align with the distribution of the data. This article from Google provides a general guidance of data transformation in clustering.

In Summary:

if normal distribution → normalization
if power law distribution → log transformation
if data doesn't conform to any distribution → quantiles

From the earlier univariate analysis, we can see that those attributes doesn't conform to either normal distribution or law distribution. Therefore, I use MinMaxScaler to shrink the data range to between 0 and 1, while maintaining the distribution shape.

3. EDA

Exploratory Data Analysis provides visual clues about whether it is likely to form insightful clusters when combining multiple variables together. It is also an imperative step because choosing an appropriate clustering algorithm is reliant on the shape of the cluster.

Some center-based algorithms (e.g. K Means) are more adaptable towards globular shapes clusters and they tend to break linear shaped clusters apart. While density-based algorithms (e.g. DBSCAN) are better at clusters with irregular shape and a lot of noise.

I have visualized those three fields in the following way.

2D Scatter Plot

age vs. annual Income
annual income vs. spending score
age vs. spending income

3D Scatter Plot:

age vs. annual income vs. spending score

In this case, it is not hard to observe that the plot of "annual income vs. spending score" generates some center-based clusters. Therefore, I will use K means algorithm to make further exploration.

K Means Clustering

1. How does K Means Clustering work?

K Means Clustering is a centre-based clustering algorithm, which means that it assigns data points to clusters based on closeness or distance, following these procedures:

Specify the number of clusters “K”.
Initiate k random centroids and assign each cluster a centroid: Centroid is the center of each cluster. There are k data points randomly selected as the centroids at the beginning and the cluster label of other data points are later defined relatively to them. Therefore, different initial centroid assignments may lead to different cluster formations.
Form K clusters by assigning data points to the closest centroid: The closest centroid is usually defined by the smallest Euclidean distance but it can also be correlation or cosine depends on the use cases.
Recompute the centroid of each cluster After all data points have been assigned to a cluster, for each cluster we recalculate the mean of all data points belonging to that cluster and define it as the new centroid.
Reach convergence when centroids no longer change Iterate step 2-4 until reaching the stoping criteria that either the centroids no longer change or the maximum number of iterations are reached

This procedure also determines that k means algorithm has the limitation of clustering points into circular shapes with similar size and it is also heavily reliant on the predefined number k.

2. How to build a K Means Clustering model?

Now let's take a closer look at how to implement it using python scikit learn.

Firstly, I define a KMeans_Algorithm function that pass the dataset and the number of clusters as the parameters.

This function calls the KMeans() from sklearn library and specify some key parameters:

n_clusters: the number of clusters
init: controls initialization techniques and I chose "k-means++" to select initial centroids in a smart way to speed up convergence
max_iter: specify the maximum number of iterations allowed for each run
algorithm: there are several options to choose from, "auto", "full" and "elkan" and I chose "elkan" because it is a type of K means algorithm variation that uses triangle inequality to make it more time efficient.
random_state: use an integer to determines the random generation of initial centroids

The output will generates following attributes and evaluation metrics:

cluster_centres_: returns an array of the centroid locations
inertia_: sum of squared distances of data points to their closest centroid
labels_: the cluster label assigned to each data point
silhouette_score: the distance between the data point to other data points in the same cluster compares to the data points in the nearest neighbor cluster

In this example, I am mainly interested in the formation of clusters with regards to "Spending Score". Therefore, I used dataset X1, X2, X3 as specified below to feed the k means clustering model.

Notice that, one of the distinct difference between unsupervised learning and supervised learning is that unsupervised doesn't require partitioning the dataset for training and testing.

Age vs. Spending Score

The code below used the dataset X1 - scaled Age vs. scaled Spending Score and visualize how the clustering changes as the number of clusters increased from 2 to 10.

The first part of the code iterates through the number of clusters and generate a scatter plot for each, with the red square representing the centroids. As shown below, how the clustering changes as the specified "k" value changes.

Afterwards, I evaluate the model performance based on two metrics: inertia and silhouette coefficient.

Since clustering is an unsupervised algorithm, we cannot directly assess the model quality based on discrepancies between the predicted results and the actual results. Therefore, the evaluation should be based on the principle of minimizing intra-cluster distance while maximizing the inter-cluster distance. There are two evaluation methods determines the optimal number of clusters based on this idea.

1) Elbow method using inertia:

Inertia measures the sum of squared distances of samples to their closest cluster centroid. With the same number of cluster, smaller the inertia indicates better clusters. The elbow method determines the optimal number of clusters by looking at the turning point ("elbow" point) of the graph below, in this case is 6.

2) Silhouette coefficient:

Here is the Silhouette coefficient formula that makes it appear daunting:

But basically it translates into minimizing intra-cluster distance while maximizing the inter-cluster distance. a(i) is the average distance of data point i to other data points in the same cluster, and b(i) is the average distance of data point i to all points in the nearest cluster. The aim is to minimize a(i) and maximize b(i), therefore when s(i) is closer to 1 indicates better clustering. In this example, both 2 and 6 are showing a peak in the score.

Based on the results from both metrics, 6 is most likely to be the optimal number of clusters when plotting "Age" against "Spending Score". However, either from the score or the visualizations, it is still not convincing enough to say that age and spending score together form insightful and distinct clusters.

Annual Income vs. Spending Score

From the EDA process, we observed that there are several distinct clusters from the chart. Now, let's take a closer look at whether k means can differentiate customers into distinct segments as expected. The code is similar to the one above but just changing the input dataset into X2 - scaled Annual Income vs. scaled Spending Score.

It is quite obvious that there are decent segmentation when the number of clusters equals 5 and it is clearly shown in the scatter plot as well.

Age vs. Annual Income vs. Spending Score

If you are interested, we can also take a further investigation of how these three fields interacts with each other.

Although, in terms of both the visualization and the metric values, there is no indication of forming any outstanding clusters, unsupervised learning is about discovering and exploring hidden effort. Nevertheless, it may be still worth the effort investigating further.

3. How does it apply to customer segmentation?

After all, the objective of clustering model is to bring insights to customer segmentation. In this exercise, grouping customers into 5 segments, in terms of two dimensions: Spending Score vs. Annual Income, is most beneficial to create tailored marketing strategies.

As shown below, customers can be segmented into five groups:

high annual income (above 70 k$) and high spending score (above 60)
high annual income (above 70 k$) and low spending score (below 40)
low annual income (below 40 k$) and high spending score (above 60)
low annual income (below 40 k$) and low spending score (below 40)
average annual income (between 40 and 70 k$) and average spending score (between 40 and 60)

Based on the distinct features of each group, the business can approach each customer groups differently. The first and third types of customers generate the most revenue and they require retention using marketing strategies, such as loyalty program or offer discounts through newsletter.

On the other hand, customer acquisition strategies are more suitable for group 2 and 4. This can be creating tailored marketing campaigns based on their income levels, since high income group and low income group will have different preferences in products.

Take a Step Further

It is worth noting several limitations using k means clustering. It doesn't work well when the clusters vary in sizes and density, or if there are any significant outliers. Therefore, we need to consider other clustering algorithms when encountering situations like this.

For example, DBSCAN is more resistance to noise and outliers, and hierarchical clustering does not assume any particular number of clusters. If you are interested in implementing DBSCAN, please check out my code on the website.

Hope you enjoy my article, thanks :)

Take Home Message

This article introduces clustering algorithm, specifically K means clustering, and how we can apply it in a business context to assist customer segmentation. It covers some key points:

prepare data for clustering analysis: data transformation and exploratory data analysis
k means clustering algorithm implementation using scikit learn
clustering algorithm evaluation using inertia and silhouette score