Using K-Modes to Cluster Categorical Data

If you've been following the series of posts I've written thus far, you will have already become familiar with the k-means clustering algorithm and how to determine the optimal number of clusters to produce using the algorithm. This clustering method, unfortunately, is only useful for numerical datasets. What happens if we want to cluster non-numerical data, such as names, locations, colors, and companies? The clustering algorithm I will cover is a variation of k-means that can be used on categorial data. This method is called K-Modes.

So, what is the K-Modes Algorithm?

The K-Modes clustering procedure is very similar to k-means, actually.

Lets say we have a small dataset of individual categorical data that we'd like to cluster into 3 groups:

 Education LevelEthnicityOccupationMarital StatusState
P1AssociatesHispanicAnalystSingleFlorida
P2BachelorsCaucasianEngineerDivorcedCalifornia
P3MastersNative AmericanEngineerSingleNorth Carolina
P4BachelorsHispanicEngineerDivorcedCalifornia
P5AssociatesAfrican AmericanAccountantMarriedTexas
P6AssociatesCaucasianEngineerMarriedTexas
P7BachelorsCaucasianAccountantWidowedFlorida
P8AssociatesAsianTeacherMarriedTexas
P9BachelorsAfrican AmericanAccountantSingleFlorida
P10MastersCaucasianLawyerSingleTexas

The process to cluster the dataset can be broken down to 4 steps.

Step 1: Select k data points at random to be initial center points

Just like with K-means, the method starts out with deciding the number of clusters desired and then choosing at random several points to be the initial center points for the algorithm.

From our little example dataset, we will choose P2, P6, and P9 as center points (centroids):

 Education LevelEthnicityOccupationMarital StatusState
P2BachelorsCaucasianEngineerDivorcedCalifornia
P6AssociatesCaucasianEngineerMarriedTexas
P9BachelorsAfrican AmericanAccountantSingleFlorida

Step 2: Calculate dissimilarities between each datapoint and the center points

K-means used Euclidean distance to determine how similar (or dissimilar) a datapoint was to a centroid. K-modes uses a different method for determining similarity.

K-modes compares each datapoint to the centroids and count up the number of dissimilar attributes.

When we take P1 for example, and compare it to each centroid, here's what we get:

P1 => P2

 Education LevelEthnicityOccupationMarital StatusState
P1AssociatesHispanicAnalystSingleFlorida
P2BachelorsCaucasianEngineerDivorcedCalifornia

Number of Dissimilarities: 5

P1 => P6

 Education LevelEthnicityOccupationMarital StatusState
P1AssociatesHispanicAnalystSingleFlorida
P6AssociatesCaucasianEngineerMarriedTexas

Number of Dissimilarities: 4

P1 => P9

 Education LevelEthnicityOccupationMarital StatusState
P1AssociatesHispanicAnalystSingleFlorida
P9BachelorsAfrican AmericanAccountantSingleFlorida

Number of Dissimilarities: 3

Step 3: Assign data points to a cluster

As with K-means, each datapoint is assigned to the centroid with the smallest dissimilarity score.

Below is a table containing the dissimilarities between each data point and the centroids and its assigned cluster.

 Cluster 1 (P2)Cluster 2 (P6)Cluster 3 (P9)Assigned Cluster
P1543Cluster 3
P2035Cluster 1
P3454Cluster 1
P4144Cluster 1
P5523Cluster 2
P6405Cluster 2
P7341Cluster 3
P8525Cluster 2
P9350Cluster 3
P10434Cluster 2

Step 4: Calculate new cluster centroids

This step is where k-modes gets its name from. We compute the new centroids for each cluster by using the mode. If there are values that have the same number of occurrence, one of the values can be chosen at random.

The new centroids for each cluster are:

 Cluster 1

 Education LevelEthnicityOccupationMarital StatusState
P2BachelorsCaucasianEngineerDivorcedCalifornia
P3MastersNative AmericanEngineerSingleNorth Carolina
P4BachelorsHispanicEngineerDivorcedCalifornia
 New CentroidBachelorsCaucasianEngineerDivorcedCalifornia

Cluster 2

 Education LevelEthnicityOccupationMarital StatusState
P5AssociatesAfrican AmericanAccountantMarriedTexas
P6AssociatesCaucasianEngineerMarriedTexas
P8AssociatesAsianTeacherMarriedTexas
P10MastersCaucasianLawyerSingleTexas
New CentroidAssociatesCaucasianAccountantMarriedTexas

Cluster 3

 Education LevelEthnicityOccupationMarital StatusState
P1AssociatesHispanicAnalystSingleFlorida
P7BachelorsCaucasianAccountantWidowedFlorida
P9BachelorsAfrican AmericanAccountantSingleFlorida
New CentroidBachelorsCaucasianAccountantSingleFlorida

Step 5: Repeat steps 2, 3 and 4

As with k-means, the algorithm  reassigns data points to clusters based on their similarity to the centroids. New centroids are then computed, and the process repeats until no changes in the cluster assignments are made.

Determining the optimal 'k'

Choosing the optimal number of clusters for k-modes can be done using a slightly modified version of the elbow method employed for k-means. Here's the steps to follow:

  1. Perform K modes clustering for a range of values of k.
  2. After each clustering, compute the  dissimilarity score between each datapoint and it's assigned cluster centroid, and add them up to get the total dissimilarity score.
  3. Create a line plot of the total dissimilarity score for each value of k.
  4. Finally look at the plot for the value of k where the dissimilarity score begins to decrease linearly. This value is your optimal number of clusters for the dataset.

Now that you know the k-modes clustering algorithm, what datasets do you plan of using it to cluster? Share your plans in the comments below.


Until next time! :)