A practical Introduction to Clustering Technique using Customer Segmentation problem (K-Means)
An application of Clustering Technique to the Customer Segmentation Issue (K-Means). Introduction to unsupervised Machine learning using K-Means
What to anticipate:
- What customer segmentation is about.
- The idea of unsupervised learning.
- The ideas behind clustering.
- How to use K-Means to develop a clustering model to divide a company's customers.
- How to perform exploratory data analysis on datasets.
- How to build a clustering model to divide the customers into groups.
- An idea on how to present your work to management professionally.
In this article, I'd like to give an introduction to unsupervised Machine learning using K-Means. We will discuss it solving a customer segmentation issue in a shopping mall problem domain. In order to communicate results to the managers of the shopping mall, I will try to present it in a narrative/story telling format. Without further ado, let's get started.
Knowing the needs and understanding the desires of the customer are crucial components of market sales. Given the rise in business activity around the world, it is critical for companies to implement optimized marketing strategies to help them stay competitive and generate more revenue.
What is Customer Segmentation?
Customer segmentation is a process of classifying customers into groups based on shared traits so that management or the business can efficiently focus on each group in accordance with their needs. In this article, we group mall customers using a machine learning approach of clustering. Different machine learning libraries will be used when developing the model.
What is K-Means?
Unsupervised learning algorithm K-Means uses vector quantization to divide a large number of observations into clusters, with each observation belonging to the cluster with the closest mean. It helps solve clustering problems.
It can be used to classify unlabeled data, or information without clearly defined categories or groups. The algorithm finds groups in the data, with the variable K indicating how many groups are found. Based on the supplied features, it then goes about assigning each data point to one of K groups iteratively. This algorithm is used in this problem because it exhibits the appropriate behavior.
The data will first be appropriately formatted before being clustered to start building the model. This entails pre-processing the data and creating a customer segmentation using the created K-means model. In order to segment or group customers based on specific selection criteria that represent a customer's behavior or characteristics, K-means produces clusters.
Development environment and tools highlight:
- Sklearn, a well-known open-source Python data science library, will be used to put the clustering algorithm into practice.
- TheAnaconda notebookcontains preinstalled libraries for machine learning which also contains our k-means clustering library and the Sklearn pre-installed.
- The development environment could be a Google Chrome-based Python Jupyter notebook or, for something a little more sophisticated, Google Colab online.
- Installing the Anaconda package provides the Jupyter notebook.
- Windows 10 pro operating system is advised or any latest O.S.
- The Anaconda is run on Python 3.9.
- The Machine learning packages/libraries will be imported in the Anaconda environment in chrome during coding.
We will proceed through the various steps below to obtain the segmentation results for the customers:
Importing Dependent Libraries
First things first, we import all the libraries we think we'll need to get going. Even if we miss a few of them, it won't matter because an error message will alert us and we'll take the appropriate action.
# Loading libraries
# mathematics operations
import pandas as pd
import numpy as np
from pandas import plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected = True)
import plotly.figure_factory as ff
Following the installation of dependencies and libraries, such asNumpy, matplotlib, seaborn, plotly,andpandas, we are able to import the csv file containing our dataset into the project environment with assistance of pandas. The dataset for the customers is read from the local directory:'C:/Users/Inuwa Mobarak Abraham/Desktop/Mall_Customers.csv'.This directory should vary in your computer.
# Loading Dataset
customer_data = pd.read_csv('C:/Users/Inuwa Mobarak A/Desktop/Mall_Customers.csv')
CustomerID, Gender, Age, Annual income (k$),andSpending Scoreare the five columns that make up the data, which is read and uploaded as shown above.
Looking at the dataset features
The shape() function can be used to assess the dataset's size. This shows us the number of rows and columns in the dataset.
We can see that none of the five columns in our dataset have null values. Furthermore, all of our entries are of the integer datatype, with the exception of thegender, which is an object. The value of gender is binary which will not affect us much in clustering. Since theNon-Nullcolumn is 200 all through we can conclude that there is not empty cells.
# checking null value
Falseproves that they are no empty cells.
Here, we begin to use graphs, plots, charts, or trees to represent our dataset in a more bare form. Hidden trends in the dataset can be seen through visualization.
To illustrate the connection between the values of Gender, we use a graph called the Andrew curve. In data visualization, an Andrew graph is a diagram where each curve represents a single data point from a dataset attribute. It has the ability to display distance, variances, and means.
plt.rcParams['figure.figsize'] = (15, 10)
# seeing the relationship between the values of Genderplotting.andrews_curves(customer_data.drop("CustomerID", axis=1), "Gender")plt.title('Andrew Curves for Gender', fontsize = 20)plt.show()
Visualizing the values in the dataset's columns allows a deeper understanding of the dataset. Such as Male and Female is depicted in the gender column. Therefore, the close proximity of our curves indicates that the corresponding data points are also close in proximity.
Seeing the distribution range of Annual Income and Age
warnings.filterwarnings('ignore')plt.rcParams['figure.figsize'] = (18, 8)plt.subplot(1, 2, 1)sns.set(style = 'whitegrid')sns.distplot(customer_data['Annual_Income_(k$)'])plt.title('Distribution of Annual Income', fontsize = 20)plt.xlabel('Range of Annual Income')plt.ylabel('Count')plt.subplot(1, 2, 2)sns.set(style = 'whitegrid')sns.distplot(customer_data['Age'], color = 'red')plt.title('Distribution of Age', fontsize = 20)plt.xlabel('Range of Age')plt.ylabel('Count')plt.show()
The distribution of annual income and distribution of age are shown in the plots above. We can see that customers typically earn around $20 per month. The majority of people earn an average of 50 to 75 dollars. We can also infer that not many customers make more than $100 US.
Visualizing Gender with Pie Chart
labels = ['Female', 'Male']size = customer_data['Gender'].value_counts()colors = ['lightgreen', 'orange']explode = [0, 0.1]plt.rcParams['figure.figsize'] = (9, 9)plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')plt.title('Gender', fontsize = 20)plt.axis('off')plt.legend()plt.show()
We also create a corresponding pie chart to illustrate the gender propositions for both men and women. In comparison to men, who have a share of 44%, women have a higher share of 56%.
Seeing the Distribution of Age
plt.rcParams['figure.figsize'] = (15, 8)sns.countplot(customer_data['Age'], palette = 'hsv')plt.title('Distribution of Age', fontsize = 20)plt.show()
The distribution of each Age group of customers is shown in greater detail in the graph above. It demonstrates how commonplace the age range of 27 to 39 is. The Mall has an equal number of customers in the Age Groups of 18, 24, 28, 54, 59, and 67, as well. People who are 55, 56, 64, and 69 years old are less common. The highest number of customers are those who are 32 years old.
Visualizing the Distribution of Annual Income
plt.rcParams['figure.figsize'] = (20, 8)sns.countplot(customer_data['Annual_Income_(k$)'], palette = 'rainbow')plt.title('Distribution of Annual Income', fontsize = 20)plt.show()
The above plotting makes the Annual Income more clear. It more clearly explains how each Income level is distributed. Customers with annual incomes ranging from 15 US dollars to 137 US dollars share a lot of characteristics with the distribution. Customers with annual incomes of 54k or 78 US dollars are more common in the mall.
Visualizing the Distribution of Spending Score
plt.rcParams['figure.figsize'] = (20, 8)sns.countplot(customer_data['Spending_Score'], palette = 'copper')plt.title('Distribution of Spending Score', fontsize = 20)plt.show()Pairplot of the datasetsns.pairplot(customer_data)plt.title('Pairplot for the Data', fontsize = 20)plt.show()
We can infer from the distribution that the majority of the customers have spending scores between 35 and 59. Additionally, there are customers who spend frequently between 73 and 75.
Heatmap of the Dataset
We look for correlations between different attributes in order to better understand the behavior of our dataset. Knowing which characteristics are closely related can help the mall plan its operations and provide better customer service. We'll look for connections between the data.
plt.rcParams['figure.figsize'] = (15, 8)sns.heatmap(customer_data.corr(), cmap = 'Wistia', annot = True)plt.title('Heatmap for the Data', fontsize = 20)plt.show()
A correlation between the dataset's various attributes is seen in the graph up top. We created a heat map to display the features that were most closely related to orange color (darker) and least closely related to yellow color (lighter). It demonstrates a little correlation between these qualities.
Visualizing Gender and Spendscore
plt.rcParams['figure.figsize'] = (18, 7)sns.boxenplot(customer_data['Gender'], customer_data['Spending_Score'], palette = 'Blues')plt.title('Gender vs Spending Score', fontsize = 20)plt.show()
The majority of Males have a Spending Score of between $25,000 and $75,000 while majority of Females have a Spending Score of between $35,000 and $75,000 This demonstrates female mall visitors spend more money than males.
We will complete the primary clustering of the different attributes here.
Importing Kmeans Algorithm
from sklearn.cluster import KMeans
# Carrying Out Clusteringwcss = for i in range(1, 11):km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)km.fit(x)wcss.append(km.inertia_)plt.plot(range(1, 11), wcss)plt.title('The Elbow Method', fontsize = 20)plt.xlabel('No. of Clusters')plt.ylabel('wcss')plt.show()
K-means employs the elbow method to count the number of clusters in the dataset. The number of clusters we extract from our dataset and plot next is determined by the number of elbows on the curve. The sum of the squared distances between each point and the cluster centroid is known as the WCCS. When the WCSS is plotted, the WCSS value will begin to decline as the number of clusters rises.
Building K-Means Model
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)y_means = km.fit_predict(x)plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'miser')plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'general')plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'cyan', label = 'target')plt.scatter(x[y_means == 3, 0], x[y_means == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')plt.scatter(x[y_means == 4, 0], x[y_means == 4, 1], s = 100, c = 'orange', label = 'careful')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')plt.style.use('fivethirtyeight')plt.title('K Means Clustering', fontsize = 20)plt.xlabel('Annual Income')plt.ylabel('Spending Score')plt.legend()plt.grid()plt.show()
The main customer clusters are shown above. This Clustering Analysis gives us a very clear understanding of the various customer segments in the Mall. Based on their Annual Income and Spending Score, our model has demonstrated that there are five distinct customer segments:Miser, General, Target, Spendthrift,andCareful, which we considered to be the best factors/attributes to identify the customer segments in a mall.
TheTargetis customers with high Anual_Income and also have high Spending_Score. Since the goal of the mall is profit, the Target tends to be the cluster of customers that will spend more money in the mall.
Carrying out clustering on the Age
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)ymeans = kmeans.fit_predict(x)plt.rcParams['figure.figsize'] = (10, 10)plt.title('Cluster of Ages', fontsize = 30)plt.scatter(x[ymeans == 0, 0], x[ymeans == 0, 1], s = 100, c = 'pink', label = 'Usual Customers' )plt.scatter(x[ymeans == 1, 0], x[ymeans == 1, 1], s = 100, c = 'orange', label = 'Priority Customers')plt.scatter(x[ymeans == 2, 0], x[ymeans == 2, 1], s = 100, c = 'lightgreen', label = 'Target Customers(Young)')plt.scatter(x[ymeans == 3, 0], x[ymeans == 3, 1], s = 100, c = 'red', label = 'Target Customers(Old)')plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 50, c = 'black')plt.style.use('fivethirtyeight')plt.xlabel('Age')plt.ylabel('Spending Score (1–100)')plt.legend()plt.grid()plt.show()
From the above clustering graph between the age of the customers and their corresponding spending scores, it shows that they are four (4) different categories which are Usual Customers, Priority Customers, Target Customers (Young) and Target Customers (Old). With this outcome, management can make various marketing strategies and policies in order to maximize customer spending scores at the mall.
Carrying out Clustering on the Annual Income
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)ymeans = kmeans.fit_predict(x)plt.rcParams['figure.figsize'] = (10, 10)plt.title('Cluster of Annual Income', fontsize = 30)plt.scatter(x[ymeans == 0, 0], x[ymeans == 0, 1], s = 100, c = 'pink', label = 'Usual Customers' )plt.scatter(x[ymeans == 1, 0], x[ymeans == 1, 1], s = 100, c = 'orange', label = 'Priority Customers')plt.scatter(x[ymeans == 2, 0], x[ymeans == 2, 1], s = 100, c = 'lightgreen', label = 'Target Customers(Young)')plt.scatter(x[ymeans == 3, 0], x[ymeans == 3, 1], s = 100, c = 'red', label = 'Target Customers(Old)')plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 50, c = 'black')plt.style.use('fivethirtyeight')plt.xlabel('Annual_Income')plt.ylabel('Spending Score (1–100)')plt.legend()plt.grid()plt.show()
The whole goal is to improve profit. Priority Customers seem to interesting. As a result, the management can make various marketing plans and regulations in order to maximize mall customers' spending scores.
Using a clustering model, we were able to divide mall customers into different groups. The shopping center is unaware of the visitors' shopping preferences and mannerisms. With this knowledge, management can react to the different types of customers' visits appropriately to maximize profit. Some of the keytakeaways are stated below:
- The model showed that Female customers had a higher spending score than the Males of 56% and 44% respectively. The mall management can try to provide more female specific goods. Or conversely consider to encourage the males by giving them more goods they would like.
- By Annual Income the least Income of customers is around 20 US Dollars and an average earning of around 50-75 US Dollars.
- There are more Customers in the Mall who have their Annual Income as 54k US Dollars or 78 US Dollars.
- By ages, the most regular customers are aged around 30-35 years. People at Age of 32 are the highest visiting customers.
- Finally, the distribution show that we may conclude that most of the Customers have a Spending Score in the range of 35-59. From this point, the mall management should try to target this
- There are customers also having high frequency around 73 and 75 spending score.
Image: Woman shop photo created by artursafronovvvv
Table of contents
Have interest in Technology, literature and music.