# K-means clustering is an unsupervised learning method.

Ribhu Nirek

2 years ago | 5 min read

Brief: K-means clustering is an unsupervised learning method. In this post, I introduce the idea of unsupervised learning and why it is useful. Then I talk about K-means clustering: mathematical formulation of the problem, python implementation from scratch and also using machine learning libraries.

# Unsupervised Learning

Typically, machine learning models make prediction on data, learning previously unseen patterns to make important business decisions. When the data set consists of labels along with data points, it is known as supervised learning, with spam detection, speech recognition, handwriting recognition being some of its use cases. The learning methods where insights are drawn from data points without any ground truth or correct labels falls under the category of unsupervised learning.

Unsupervised learning is one of the basic techniques used in exploratory data analysis to make sense of the data before preparing to make complex machine learning models to make inferences. As this does not consist of human-labelled data, bias is minimized.

Also, as there are no labels, there are no correct answers. From a probabilistic standpoint the contrast between supervised and unsupervised learning is the following: supervised learning infers the conditional probability distribution p(x|y), whereas unsupervised learning is concerned with the prior probability p(x).

# K-Means Clustering Algorithm

Objective of clustering methods is to separate data points into separate clusters(pre-determined) maximizing inter-cluster distance and minimizing intra-cluster distance(increasing similarity).

K-Means is one of the clustering techniques in unsupervised learning algorithms. Some other commonly used techniques are fuzzy clustering(soft k-means), hierarchical clustering, mixture models. Hard clustering or hard k-means is assigning each data point to only one cluster instead (e.g. email Spam or not Spam) instead of assigning a non-zero membership value to each cluster(Spam: 13%, Not Spam: 87%) as in soft k-means. I am covering hard-clustering in this post.

How the K-means algorithm works:

1. Pick k centroids randomly(without replacement) from X.

2. Compute distance(L2 or Euclidean distance) of each x from all μ’s.

3. Pick the closest cluster one as the label for this x.

4. Update centroids by finding arithmetic mean of each k clusters.

5. Repeat steps 2–4 until centroids stop changing.

Mathematically, it can be reduced to finding an optimal partition S* of the dataset X.

Mathematical formulation of K-means

# Code

Firstly, I will be writing the basic implementation of k-means from scratch in python.

`import numpy as npimport matplotlib.pyplot as pltclass kmeans:    """Apply kmeans algorithm"""    def __init__(self, num_clusters, max_iter=1000):        """Initialize number of clusters"""                self.num_clusters = num_clusters        self.max_iter = max_iter        def initalize_centroids(self, X):        """Choosing k centroids randomly from data X"""                idx = np.random.permutation(X.shape[0])        centroids = X[idx[:self.num_clusters]]        return centroids            def compute_centroid(self, X, labels):        """Modify centroids by finding mean of all k partitions"""                centroids = np.zeros((self.num_clusters, X.shape[1]))        for k in range(self.num_clusters):            centroids[k] = np.mean(X[labels == k], axis=0)                    return centroids        def compute_distance(self, X, centroids):        """Computing L2 norm between datapoints and centroids"""        distances = np.zeros((X.shape[0], self.num_clusters))                for k in range(self.num_clusters):            dist = np.linalg.norm(X - centroids[k], axis=1)            distances[:,k] = np.square(dist)                    return distances        def find_closest_cluster(self, distance):        return np.argmin(distance, axis=1)        def fit(self, X):        self.centroids = self.initalize_centroids(X)                for i in range(self.max_iter):            old_centroids = self.centroids            distance = self.compute_distance(X, old_centroids)            self.labels = self.find_closest_cluster(distance)            self.centroids = self.compute_centroid(X, self.labels)                        if np.all(old_centroids == self.centroids):                break            def compute_sumstar(self, distances):        """Computing sum total of all distances"""        pass`

Let’s generate some data and apply k-means to see how it works.

`# creating an artificial test casenp.random.RandomState(1234)data = -2 * np.random.rand(1000, 2) data[500:] = 1 + 2 * np.random.rand(500,2)# plotting the data plt.figure(figsize=(6,6))plt.scatter(data[:,0], data[:, 1])plt.show()`

Synthesized data

`# Applying k-means on the datakmeansmodel = kmeans(num_clusters=2, max_iter=100)kmeansmodel.fit(data)centroids = kmeansmodel.centroidscentroids[0]# plotting the clustered data with the centoidsplt.figure(figsize=(6,6))plt.scatter(data[kmeansmodel.labels == 0, 0], data[kmeansmodel.labels == 0, 1], c = 'green', label = 'Cluster 1')plt.scatter(data[kmeansmodel.labels == 1, 0], data[kmeansmodel.labels == 1, 1], c = 'blue', label = 'Cluster 2')plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', c = 'red', s = 300, label = 'centroid')plt.xlabel('Dimension 1')plt.ylabel('Dimension 2')plt.legend()plt.show()`

Output: K-means from scratch

Not bad, huh? Building a model from scratch in 50 lines of code is cool :)

The same task can be done within a few lines by importing the scikit-learn library.

`import numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans# creating an artificial test casenp.random.RandomState(1234)data = -2 * np.random.rand(1000, 2) data[500:] = 1 + 2 * np.random.rand(500,2)# plotting the data plt.figure(figsize=(6,6))plt.scatter(data[:,0], data[:, 1])plt.show()kmeansmodel = KMeans(n_clusters=2)kmeansmodel.fit(data)# plotting the clustered data with the centoidsplt.figure(figsize=(6,6))plt.scatter(data[kmeansmodel.labels_ == 0, 0], data[kmeansmodel.labels_ == 0, 1], c = 'green', label = 'Cluster 1')plt.scatter(data[kmeansmodel.labels_ == 1, 0], data[kmeansmodel.labels_ == 1, 1], c = 'blue', label = 'Cluster 2')plt.scatter(kmeansmodel.cluster_centers_[:, 0], kmeansmodel.cluster_centers_[:, 1], marker='*', c = 'red', s = 300, label = 'centroid')plt.xlabel('Dimension 1')plt.ylabel('Dimension 2')plt.legend()plt.show()`

Output: K-means using sklearn

Sklearn gives pretty much the same output as the model we built from scratch on this dummy data set.

Once you have written a basic bare and bones structure from scratch and are familiar with the nitty-gritty of the implementation. After that, implementing k-means or any other algorithm is a walk in the park using specialized library functions.

# Conclusion

K-means is one of the simplest unsupervised learning methods. It can be used to draw insights for EDA before moving on to build a sophisticated architecture to make decisions. This blog is a good starting point to get some idea about unsupervised learning, clustering, k-means and its implementation.

Upvote

Created by

Ribhu Nirek

Data Science, Blogger, University of Maryland '21, IITK'19

Post

Upvote

Downvote

Comment

Bookmark

Share

Related Articles