Outlier Clustering with K-Medoids

In my last post, I introduced you to the K-Medians algorithm, a data clustering method that can be used to cluster datasets with outliers. In this post I will be introducing you to another k-means variation that is also effective on datasets with outliers. It's called K-Medoids. As a matter of fact, there are three variations of this clustering method, each with its own advantages and disadvantages. I will be covering just one of these variations in this post. The other two will be covered in subsequent posts.

Introducing the K-Medoids Algorithm

The K-Medoids algorithm, much like the other variations we've covered thus far, works and functions similarly to k-means. What makes it different is:

  • Unlike the other variations we've covered up to this point, any distance function can be used to compute dissimilarities between the data points and the cluster centroids. That means you can use Euclidean distance and Manhattan distance. I will be covering more distance functions that you can use for clustering in a future post.
  • Centroids are computed by finding the medoid of the data points in each cluster. A medoid is a data point from a data set whose average dissimilarity is the smallest. In other words the medoid is the most centrally located point in the dataset. The medoid, like the median, is not influenced by extreme data points. K-Medoids therefore does not suffer from the same limitations as k-means.
PAM data clustering medoids

Partitioning Around Medoids

As I mentioned at the start of this post, there are three variations of k-Medoids clustering. The most common approach, and the one I will be explaining in this post is known as the Partitioning  Around Medoids (PAM) algorithm. The algorithm works as follows:

  1. Initialize. Randomly select k data points as centroids.
  2. Assign. Pick any distance method of your choosing and use it to compute the dissimilarity between each data point and the k centroids. Assign each data point to the closest centroid.
  3. Update. Perform the following for each cluster:
    • Let c represent the cluster centroid, and p represent a non-centroid data point. For each data point:
      1. Swap positions. That is, c becomes p and p becomes c.
      2. Use the distance function you chose in step 2 to compute the average distance between each data point and the centroid.

After computing the average distance for each datapoint, the new centroid becomes the data point with the smallest average distance.

Choosing optimal K

The Elbow method and Sihouette method can be used to find the optimal value of k for K-Medoids. See this post for a description of these methods.

Up Next...

The PAM algorithm is able to accurately compute the medoids for each cluster. But this accuracy comes at a cost.  The algorithm is computationally time consuming and this therefore used when the size of the data set is relatively small. For much larger datasets we need to use more efficient methods. I will be discussing one of these methods in my next post. Until next time.