cft

Fixing The Biggest Problem of K Means Clustering

Initiate clusters smartly rather than picking them randomly at the beginning


user

Mertcan Coskun

2 years ago | 4 min read

For starters in this field, I would like to start off by stating that K means Clustering is a very useful technique used in Unsupervised Machine Learning and you should definitely master it.

K means Clustering

Clustering can help us explore the dataset and separate cases into groups representing similar traits or characteristics. Each group could be a potential candidate for a class.

The main problem that most people have with the K means Clustering algorithm seems to be picking the number of clusters, K. But there is a less talked about issue of initialization. How do we initialize centroids?

A major weakness in k-means is in the first step where it randomly picks k of the data points to be the initial centroids.

The randomness of K means

The initialization process of standard K means algorithm is totally random. Because it is random, sometimes the initial centroids are a great pick and lead to near optimal clustering.

Other times the initial centroids are a reasonable pick and lead to good clustering. But sometimes — again, because they are picked randomly — sometimes the initial centroids are poor leading to non-optimal clustering.

It is not hard to guess that this creates problems, especially in bigger and more complex datasets it runs for too long and may not reach perfect clustering.

That’s where k means ++ comes in. The k-means++ algorithm fixes this defect by changing the way we pick the initial centroids. Everything else about k-means remains the same.

How does k means ++ work to solve the issue?

The steps of k means ++ are the following:

  1. Create an empty list for centroids.
  2. Select the first centroid randomly as before.
  3. Until K initial centroids are selected, do:
  • Compute the distance between each point and its closest centroid.
  • In a probability proportional to distance, select one point at random to be a new centroid and add it to the set of centroids.
  • Repeat until the end.

Here, the most important part was “In a probability proportional to distance”. What does that mean exactly? Suppose the algorithm is running and we have two centroids already.

Now there are 5 centroids left. The table below shows the distances of those datapoints to each cluster. The closest of each distance will be selected and their weights will be calculated, from which a random choice will be made.

Let’s visualize. DC:Distance to cluster, DP: Datapoint.

Figure: Weight assigning process
Figure: Weight assigning process

The new cluster will be picked from the weights table, bigger weights having higher chance of being selected, like a bigger slices out of a pie.

Figure: Pie of probabilities
Figure: Pie of probabilities

The code

The standard K means algorithm uses the following code to initialize centroids (random model):

self.centroids = [[self.data[i][r] for i in range(1, len(self.data))] for r in random.sample(range(len(self.data[0])), self.k)]

Code: Standard K means centroid initialization

The K means ++ algorithm, we need to replace it with the following code:

def selectInitialCentroids(self):
centroids = []
total = 0
current = random.choice(range(len(self.data[0])))
centroids.append(current)
for i in range(0, self.k - 1):
weights = [self.distanceToClosestCentroid(x, centroids)
for x in range(len(self.data[0]))]
total = sum(weights)
weights = [x / total for x in weights]
num = random.random()
total = 0
x = -1
while total < num:
x += 1
total += weights[x]
centroids.append(x)
self.centroids = [[self.data[i][r] for i in range(1, len(self.data))] for r in centroids]

Code:K means++ centroid initialization

Implementing the algorithm on real data

Lets take a look at our exemplary dataset.

Figure: Our Dataset
Figure: Our Dataset

Our data consists of 15 food items, which will be divided into three groups. You have the fast food items with high sodium and calories, the drinks with high calories but low sodium and veggies with low calories and sodium. We can visualize this and see it for ourselves:

Figure: Dataset Visualized
Figure: Dataset Visualized

We can see that there are 3 clusters. Let’s see how the two algorithms did.

Figure: Clustering Results based on standard and ++ method
Figure: Clustering Results based on standard and ++ method

The K means++ algorithm correctly clustered every single item while the standard K means algorithm mixed some fast food items with the drinks.

Conclusion

In our example it was shown that standard K means resulted in imperfect clustering.

The difference between k-means and k-means++ on the real-world datasets is quite substantial.Deeper tests prove that k-means++ consistently outperforms k-means, both by achieving a lower potential value and also by completing faster.

The example used here is very simple, we know that burgers go into fast food and broccoli is a vegetable. But don’t let that fool you. Unsupervised clustering algorithms have incredible power in real life problems, from genetics engineering to customer segmentation.

Even in a small example, the K means ++ algorithm showed its superiority. It’s an improved version of the standard K means algorithm, and I hope that you use it in your next task.

Upvote


user
Created by

Mertcan Coskun


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles