Using K-Modes to Cluster Categorical Data
If you've been following the series of posts I've written thus far, you will have already become familiar with the k-means clustering algorithm and how to determine the optimal number of clusters to produce using the algorithm. This clustering method, unfortunately, is only useful for numerical datasets. What happens if we want to cluster non-numerical data, such as names, locations, colors, and companies? The clustering algorithm I will cover is a variation of k-means that can be used on categorial data. This method is called K-Modes.
So, what is the K-Modes Algorithm?
The K-Modes clustering procedure is very similar to k-means, actually.
Lets say we have a small dataset of individual categorical data that we'd like to cluster into 3 groups:
Education Level | Ethnicity | Occupation | Marital Status | State | |
P1 | Associates | Hispanic | Analyst | Single | Florida |
P2 | Bachelors | Caucasian | Engineer | Divorced | California |
P3 | Masters | Native American | Engineer | Single | North Carolina |
P4 | Bachelors | Hispanic | Engineer | Divorced | California |
P5 | Associates | African American | Accountant | Married | Texas |
P6 | Associates | Caucasian | Engineer | Married | Texas |
P7 | Bachelors | Caucasian | Accountant | Widowed | Florida |
P8 | Associates | Asian | Teacher | Married | Texas |
P9 | Bachelors | African American | Accountant | Single | Florida |
P10 | Masters | Caucasian | Lawyer | Single | Texas |
The process to cluster the dataset can be broken down to 4 steps.
Step 1: Select k data points at random to be initial center points
Just like with K-means, the method starts out with deciding the number of clusters desired and then choosing at random several points to be the initial center points for the algorithm.
From our little example dataset, we will choose P2, P6, and P9 as center points (centroids):
Education Level | Ethnicity | Occupation | Marital Status | State | |
P2 | Bachelors | Caucasian | Engineer | Divorced | California |
P6 | Associates | Caucasian | Engineer | Married | Texas |
P9 | Bachelors | African American | Accountant | Single | Florida |
Step 2: Calculate dissimilarities between each datapoint and the center points
K-means used Euclidean distance to determine how similar (or dissimilar) a datapoint was to a centroid. K-modes uses a different method for determining similarity.
K-modes compares each datapoint to the centroids and count up the number of dissimilar attributes.
When we take P1 for example, and compare it to each centroid, here's what we get:
P1 => P2
Education Level | Ethnicity | Occupation | Marital Status | State | |
P1 | Associates | Hispanic | Analyst | Single | Florida |
P2 | Bachelors | Caucasian | Engineer | Divorced | California |
Number of Dissimilarities: 5
P1 => P6
Education Level | Ethnicity | Occupation | Marital Status | State | |
P1 | Associates | Hispanic | Analyst | Single | Florida |
P6 | Associates | Caucasian | Engineer | Married | Texas |
Number of Dissimilarities: 4
P1 => P9
Education Level | Ethnicity | Occupation | Marital Status | State | |
P1 | Associates | Hispanic | Analyst | Single | Florida |
P9 | Bachelors | African American | Accountant | Single | Florida |
Number of Dissimilarities: 3
Step 3: Assign data points to a cluster
As with K-means, each datapoint is assigned to the centroid with the smallest dissimilarity score.
Below is a table containing the dissimilarities between each data point and the centroids and its assigned cluster.
Cluster 1 (P2) | Cluster 2 (P6) | Cluster 3 (P9) | Assigned Cluster | |
P1 | 5 | 4 | 3 | Cluster 3 |
P2 | 0 | 3 | 5 | Cluster 1 |
P3 | 4 | 5 | 4 | Cluster 1 |
P4 | 1 | 4 | 4 | Cluster 1 |
P5 | 5 | 2 | 3 | Cluster 2 |
P6 | 4 | 0 | 5 | Cluster 2 |
P7 | 3 | 4 | 1 | Cluster 3 |
P8 | 5 | 2 | 5 | Cluster 2 |
P9 | 3 | 5 | 0 | Cluster 3 |
P10 | 4 | 3 | 4 | Cluster 2 |
Step 4: Calculate new cluster centroids
This step is where k-modes gets its name from. We compute the new centroids for each cluster by using the mode. If there are values that have the same number of occurrence, one of the values can be chosen at random.
The new centroids for each cluster are:
Cluster 1
Education Level | Ethnicity | Occupation | Marital Status | State | |
P2 | Bachelors | Caucasian | Engineer | Divorced | California |
P3 | Masters | Native American | Engineer | Single | North Carolina |
P4 | Bachelors | Hispanic | Engineer | Divorced | California |
New Centroid | Bachelors | Caucasian | Engineer | Divorced | California |
Cluster 2
Education Level | Ethnicity | Occupation | Marital Status | State | |
P5 | Associates | African American | Accountant | Married | Texas |
P6 | Associates | Caucasian | Engineer | Married | Texas |
P8 | Associates | Asian | Teacher | Married | Texas |
P10 | Masters | Caucasian | Lawyer | Single | Texas |
New Centroid | Associates | Caucasian | Accountant | Married | Texas |
Cluster 3
Education Level | Ethnicity | Occupation | Marital Status | State | |
P1 | Associates | Hispanic | Analyst | Single | Florida |
P7 | Bachelors | Caucasian | Accountant | Widowed | Florida |
P9 | Bachelors | African American | Accountant | Single | Florida |
New Centroid | Bachelors | Caucasian | Accountant | Single | Florida |
Step 5: Repeat steps 2, 3 and 4
As with k-means, the algorithm reassigns data points to clusters based on their similarity to the centroids. New centroids are then computed, and the process repeats until no changes in the cluster assignments are made.
Determining the optimal 'k'
Choosing the optimal number of clusters for k-modes can be done using a slightly modified version of the elbow method employed for k-means. Here's the steps to follow:
- Perform K modes clustering for a range of values of k.
- After each clustering, compute the dissimilarity score between each datapoint and it's assigned cluster centroid, and add them up to get the total dissimilarity score.
- Create a line plot of the total dissimilarity score for each value of k.
- Finally look at the plot for the value of k where the dissimilarity score begins to decrease linearly. This value is your optimal number of clusters for the dataset.
Now that you know the k-modes clustering algorithm, what datasets do you plan of using it to cluster? Share your plans in the comments below.
Until next time! :)