Using K-Medians for Outlier Clustering.
Hey, readers! I'm back again with another blog post. If you've been following the series of posts I've been releasing thus far you will already be well familiar with the k-means clustering algorithm. In my last blog post I covered k-modes, a variation of k-means that can be used for clustering categorial data. In this blog post I will introduce you to another variation of k-means called k-medians.
The case for K-Medians
In the K-Means post, I mentioned that one of the limitations of the algorithm is its sensitivity to outliers. The mere existence of just one extreme value in the dataset can have a huge influence on the placement of the cluster centroids and could cause outliers to get their own clusters instead of being ignored.

One way to avoid this issue is to remove the outliers from the dataset prior to clustering. But what if that wasn't an option for you? What if there's just too many outliers in your dataset for you to simply remove? That's where K-Medians comes in.

What makes K-Medians different from K-Means is two things:
First off, K-Medians computes the dissimilarity between data points using Manhattan distance instead of Euclidean distance. More on Manhattan distance later in this post.
Secondly, as the name suggests, K-Medians computes new cluster centroids using the median. The median is a robust measure of a dataset's center and this therefore less sensitive to the existence of outliers in the dataset.
To demonstrate this fact, let's suppose we have a small dataset of values:
1, 6, 9, 7, 12
The mean of the dataset is 7.If another data point with the value 7000 was added to the data the resulting mean would be 1172.5, which is drastically different from the original dataset.
If we computed the median of the dataset with and without the new data the result would be 7 and 8 respectively. The median of the dataset with the additional datapoint is only slightly higher than the median of the original dataset.
By using the median instead of mean for calculating new centers for each cluster, K-medians does not suffer from the outlier sensitivity that K-means has.
About the Manhattan Distance
Another name for the Manhattan distance is the city block distance.

Imagine each cell is a building and the grid lines are side walks. If one wanted to travel from point A to point B in the picture, he/she could not take a straight path. Both paths available in the image would require a turn to be taken. The Manhattan distance can be used to calculate the distance walked.
The equation for the Manhattan distance is the following:

Allow me to break this formula down and explain how the distance is calculated.
Given two data points, x and y the formula adds up the absolute difference between each component from x and y. As an example, computing the Manhattan distance between the data points (3,5,9,1) and (7,2,0,8) yields the following:
|3 - 7| + |5 - 2| + |9 - 0| + |1 - 8| = 23
The Algorithm
Now that you're aware of the Manhattan distance and the median, here's the breakdown of the K-Medians clustering algorithm
- Select k data points at random to be initial center points
- Use the Manhattan distance to calculate dissimilarities between each datapoint and the center points.
- Assign each datapoint to the cluster with the smallest Manhattan distance.
- Use the median to compute new center points for each cluster
- Repeat steps 2, 3, and 4 until there's no change in the cluster assignments.
Choosing K
As with K-means, the Sihouette method can be used to find the optimal value for k. A modified version of the elbow method can also be used. Instead of using the sum of squared error to calculate the total distance between each data point and it's centroid, the absolute difference is used:

For each cluster the absolute error between the data point pi and the centroid µj is computed. These differences are then summed up to get the total error for the clustering.
From there the process of producing the elbow plot is the same as previously described in my earlier posts.
What's Next
That's it folks. In my next post, I will cover yet another variation of k-means that is also robust to outliers. Until next time.