Brief: K-means clustering is an unsupervised learning method. In this post, I introduce the idea of unsupervised learning and why it is useful. Then I talk about K-means clustering: mathematical formulation of the problem, python implementation from scratch and also using machine learning libraries.

Unsupervised Learning

Typically, machine learning models make prediction on data, learning previously unseen patterns to make important business decisions. When the data set consists of labels along with data points, it is known as supervised learning, with spam detection, speech recognition, handwriting recognition being some of its use cases. The learning methods where insights are drawn from data points without any ground truth or correct labels falls under the category of unsupervised learning.

Unsupervised learning is one of the basic techniques used in exploratory data analysis to make sense of the data before preparing to make complex machine learning models to make inferences. As this does not consist of human-labelled data, bias is minimized.

Also, as there are no labels, there are no correct answers. From a probabilistic standpoint the contrast between supervised and unsupervised learning is the following: supervised learning infers the conditional probability distribution p(x|y), whereas unsupervised learning is concerned with the prior probability p(x).

K-Means Clustering Algorithm

Objective of clustering methods is to separate data points into separate clusters(pre-determined) maximizing inter-cluster distance and minimizing intra-cluster distance(increasing similarity).

K-Means is one of the clustering techniques in unsupervised learning algorithms. Some other commonly used techniques are fuzzy clustering(soft k-means), hierarchical clustering, mixture models. Hard clustering or hard k-means is assigning each data point to only one cluster instead (e.g. email Spam or not Spam) instead of assigning a non-zero membership value to each cluster(Spam: 13%, Not Spam: 87%) as in soft k-means. I am covering hard-clustering in this post.

How the K-means algorithm works:

Pick k centroids randomly(without replacement) from X.

2. Compute distance(L2 or Euclidean distance) of each x from all μ’s.

3. Pick the closest cluster one as the label for this x.

4. Update centroids by finding arithmetic mean of each k clusters.

5. Repeat steps 2–4 until centroids stop changing.

Mathematically, it can be reduced to finding an optimal partition S* of the dataset X.

Mathematical formulation of K-means

Code

Firstly, I will be writing the basic implementation of k-means from scratch in python.

import numpy as np
import matplotlib.pyplot as plt

class kmeans:
    """Apply kmeans algorithm"""
    def __init__(self, num_clusters, max_iter=1000):
        """Initialize number of clusters"""
        
        self.num_clusters = num_clusters
        self.max_iter = max_iter
    
    def initalize_centroids(self, X):
        """Choosing k centroids randomly from data X"""
        
        idx = np.random.permutation(X.shape[0])
        centroids = X[idx[:self.num_clusters]]
        return centroids
        
    def compute_centroid(self, X, labels):
        """Modify centroids by finding mean of all k partitions"""
        
        centroids = np.zeros((self.num_clusters, X.shape[1]))
        for k in range(self.num_clusters):
            centroids[k] = np.mean(X[labels == k], axis=0)
            
        return centroids
    
    def compute_distance(self, X, centroids):
        """Computing L2 norm between datapoints and centroids"""

        distances = np.zeros((X.shape[0], self.num_clusters))
        
        for k in range(self.num_clusters):
            dist = np.linalg.norm(X - centroids[k], axis=1)
            distances[:,k] = np.square(dist)
            
        return distances
    
    def find_closest_cluster(self, distance):
        return np.argmin(distance, axis=1)
    
    def fit(self, X):
        self.centroids = self.initalize_centroids(X)
        
        for i in range(self.max_iter):
            old_centroids = self.centroids
            distance = self.compute_distance(X, old_centroids)
            self.labels = self.find_closest_cluster(distance)
            self.centroids = self.compute_centroid(X, self.labels)
            
            if np.all(old_centroids == self.centroids):
                break
        
    def compute_sumstar(self, distances):
        """Computing sum total of all distances"""
        pass

Let’s generate some data and apply k-means to see how it works.

# creating an artificial test case

np.random.RandomState(1234)
data = -2 * np.random.rand(1000, 2) 
data[500:] = 1 + 2 * np.random.rand(500,2)

# plotting the data 

plt.figure(figsize=(6,6))
plt.scatter(data[:,0], data[:, 1])
plt.show()

Synthesized data

# Applying k-means on the data

kmeansmodel = kmeans(num_clusters=2, max_iter=100)
kmeansmodel.fit(data)
centroids = kmeansmodel.centroids

centroids[0]

# plotting the clustered data with the centoids

plt.figure(figsize=(6,6))
plt.scatter(data[kmeansmodel.labels == 0, 0], data[kmeansmodel.labels == 0, 1], c = 'green', label = 'Cluster 1')
plt.scatter(data[kmeansmodel.labels == 1, 0], data[kmeansmodel.labels == 1, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', c = 'red', s = 300, label = 'centroid')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()

Output: K-means from scratch

Not bad, huh? Building a model from scratch in 50 lines of code is cool :)

The same task can be done within a few lines by importing the scikit-learn library.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# creating an artificial test case

np.random.RandomState(1234)
data = -2 * np.random.rand(1000, 2) 
data[500:] = 1 + 2 * np.random.rand(500,2)

# plotting the data 

plt.figure(figsize=(6,6))
plt.scatter(data[:,0], data[:, 1])
plt.show()

kmeansmodel = KMeans(n_clusters=2)
kmeansmodel.fit(data)

# plotting the clustered data with the centoids

plt.figure(figsize=(6,6))
plt.scatter(data[kmeansmodel.labels_ == 0, 0], data[kmeansmodel.labels_ == 0, 1], c = 'green', label = 'Cluster 1')
plt.scatter(data[kmeansmodel.labels_ == 1, 0], data[kmeansmodel.labels_ == 1, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(kmeansmodel.cluster_centers_[:, 0], kmeansmodel.cluster_centers_[:, 1], marker='*', c = 'red', s = 300, label = 'centroid')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()

Output: K-means using sklearn

Sklearn gives pretty much the same output as the model we built from scratch on this dummy data set.

Once you have written a basic bare and bones structure from scratch and are familiar with the nitty-gritty of the implementation. After that, implementing k-means or any other algorithm is a walk in the park using specialized library functions.

Conclusion

K-means is one of the simplest unsupervised learning methods. It can be used to draw insights for EDA before moving on to build a sophisticated architecture to make decisions. This blog is a good starting point to get some idea about unsupervised learning, clustering, k-means and its implementation.

Feel free to read, code and explore to learn more. Drop a note down below to share your experience. Thanks for reading :)

This article was originally published by Ribhu nirek on medium