Mixed Data Clustering using K-Prototypes
This blog has thus far covered a number of data clustering algorithms. A majority of the algorithms operate under the assumption that the dataset to be clustered is purely numerical. The k-Modes algorithm was the one method we've covered that works on purely categorial data. What if you want to cluster a dataset that contains both categorial and numeric data. That's exactly what the k-prototypes algorithm allows you to do.
What is a Prototype?
In K-Means, the centroids are found by calculating the mean of each cluster's data points. In K-Modes, these are derived from the mode. For datasets with mix of categorical and numerical data, the centroids are found by calculating the mean of the numeric data and taking the mode of the categorial data. A prototype is simply a combination of the mean and the mode.
City | Eye Color | Age | Score | |
Miami | Blue | 32 | 93 | |
Miami | Green | 25 | 32 | |
Chicago | Black | 27 | 72 | |
Miami | Blue | 21 | 63 | |
Prototype | Miami | Blue | 26.25 | 65 |
The function that is used to check how similar or different each data point is to a prototype is also a combination of the Euclidean distance from K-Means and the matching function from K-Modes:

Let's take this function apart and examine what each piece means:
The red part of the function above is the k-means Euclidean measurement, and the green part is the k-modes matching dissimilarity measurement.
The symbols p and q in the function above represent two individual data points.
N and C represent sets of numeric and categorial variables and n and c represent a variable from both sets. The symbols pn, qn, pc, qc represent an numeric and categorial values.
Something new you might have also noticed is the gamma term:

This term controls the amount of weight categorial variables contribute to the output.
A large enough value for gamma would result in more focus being placed categorial variables. As a result, a point could be assigned to a cluster where the cluster's prototype is not nearest to it in terms of numeric variables, but the categorial values of most of the other points in the cluster matches that point.
Here's an example of the dissimilarity function in action using two data points:
City | Eye Color | Age | Score | |
p | Miami | Black | 32 | 93 |
q | Chicago | Black | 27 | 72 |
From the function above, N would represent a set containing the numeric variables {Age, Score}, and C is a set containing categorial variables {City, Eye Color}.
Calculating the Euclidean (red) part of the function, we get the following:
For the matching (green) part of the function, lets suppose that gamma is set to 1.5. Here's what we'd get:

Adding the results from both calculations together, the final result is 23.08.
Here's the step by step breakdown of the k-prototypes procedure, taking into account everything covered so far:
- Choose k and gamma. Determine the number of clusters to form and the amount of weight categorial variables should have on the clustering.
- Select initial prototypes. In this step the algorithm chooses k data points from the dataset at random to be the initial prototypes.
- Assign data points to a prototype. In this step the algorithm calculates the dissimilarity between each data point in the dataset and each prototype and assigns the data point to the prototype it is closest to.
- Calculate new prototypes. A new prototype is calculated for each cluster using the dissimilarity function described earlier.
- Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new prototypes. After the reassignment new prototypes are computed. This process is repeated until no changes in the assignments are made.
Determining Optimal Number of Clusters
Like all the other methods covered before this one, the Elbow method can be used to determine optimal number of clusters. Here's the new cost function you will be using for the elbow plot:

As you might have guessed, this is simply a combination of the sum of squared error from k-means and the total dissimilarity score from k-modes.
But wait, what about optimal values for gamma?
With the addition of the gamma term, you might be wondering if there's a method for finding good values for that parameter. Finding optimal values involves pure experimentation. You perform k-prototypes on your dataset several times with different values and choose a value that weighs the categorical and numeric attributes in the way you expect.
If you're looking for a quick and dirty value that provides a decent balance between the categorical and numeric attributes in your dataset, a good heuristic you can use is the standard deviation of the numeric attributes in the dataset. Z. Huang, in his early experiments with the k-prototypes algorithm, found a suitable value to lie between 1/3 and 2/3 of the standard deviation of the numeric data. Here's a link to his paper where he describes the experiments more in detail.
That's all, folks!
I hope you enjoyed this tutorial. One thing that was new with this segmentation method was the usage of a gamma paramter to control the influence variables have on the output. In my next post I will cover another clustering algorithm that also makes use of weights. Stay tuned.