Mediods Data Clustering with CLARANS

K-Medoids is a data clustering algorithm that is commonly used on datasets containing extreme values. In previous posts on the subject, I covered two implementations of the algorithm that can be used to find medoids in a dataset.

The first method, called PAM, scans every data point in the dataset for k medoids. PAM can accurately locate the medoids in the dataset, but at expense of efficiency on large datasets.

CLARA was the second method I covered. It's able to achieve superior performance on large datasets by restricting the scan to a smaller randomly sampled portion of the dataset. An issue that may arise with CLARA, however, is that the true medoids of the dataset might get excluded from the sample, causing the method to return sub-optimal results.

This is where the next algorithm I will discuss comes into play. This algorithm, like CLARA, does not check every data point in the dataset. But unlike CLARA, its scan is not restricted to a small sample of data points, but at random neighborhoods of points selected at multiple iterations. This method is called CLARANS, which is short for  Clustering LARge Applications based on RANdomized Search.

The Search For the Best Medoids Data Clustering

The CLARANS clustering algorithm starts out just like any other k clustering method. K data points are chosen at random to be the medoids that all the other data points in the data set will cluster around. From there, CLARANS searches for neighboring medoids that cluster the data better. A better clustering is defined as a set of medoids where the total dissimilarity between each medoid and the data points assigned to it is as small as possible.

For the sake of making this process CLARANS takes to search for better medoids easier to understand, I want you to imagine that you are standing in a large room littered with dozens of water filled holes.  You and your friends are playing a game where you are tasked with finding the deepest hole in the room. The catch is that this has to be done blindfolded.

Each hole in the room represents a set of k data points from the dataset chosen at random.

After your friends place you at one of the holes, you begin your task for looking for the deepest hole in the room. Because you're unable to determine this visually, you do this by sticking your bare foot into the holes and measuring the depth by how high the water rises up your leg. The depth a hole represents the total cluster dissimilarity for that set of k medoids.

You make a mental note of  the depth of the hole that's in front of you and then randomly choose one of the nearby holes to measure next.

In the case of the CLARANS clustering algorithm, two sets of k medoids are neighbors if they share all but one medoid.

Neighboring set of k-medoids, where k = 7

CLARANS selects a neighbor by randomly selecting one of the k medoids from the dataset and replacing it with a data point from the dataset. From there it computes the total cluster dissimilarity of the neighbor and compares it with the dissimilarity of the current set of medoids. 

CLARANS searches several neighbors, keeping track of the best set of mediods it finds thus far. The maximum number of neighbors that is searched is actually predetermined at the beginning of the algorithm. We'll call this parameter maxneighbors. Each time CLARANS finds a better set of medoids, it searches maxneighbors more neighbors to see if it finds an even better set. If it fails to find a better clustering after examining additional neighbors, the search is concluded. The set of medoids CLARANS settles on is known as a local minima.

The following flow chart below summarizes the local search described above:

CLARANS data clustering local search flow diagram

From Local Search to Global Search

Going back to our toy scenario, after searching several neighboring holes, you were able to find the deepest hole within a the small region of holes that you searched. That hole you found, may not be the deepest hole in the room.

To truly determine this, you decided to repeat your local neighborhood search at other random holes in the room, and compare the deepest hole you found with what you just found. With that your friends proceed to move you to other random holes in the room where you can conduct additional searches for local minimas.

Translating this process to the CLARANS algorithm, the number of local searches that is performed is another parameter that is determined at the start of the algorithm. We'll call this number numlocal. After performing several of these local searches, CLARANS has a pretty good approximation of the true medoids in the dataset.  

The entire CLARANS data clustering process can be summarized with the following steps:

  1. Specify parameters numlocal, k, and maxneighbors.
  2. Let B represent the best k mediods discovered so far.
  3. Let f represent the number of local searches to perform. Set this number to 1.
  4. Let H represent  k randomly selected medoids from the data set.
  5. Let c represent the number of neighbors inspected. Set this number to 1.
  6. Let s represent a random medoid from H. Let j represent a data point from the dataset that is not in H. Let N equal a set of k medoids that share the same points as H, except s is replaced by j.
  7. If the total cluster dissimilarity in N is less than that of H, set H to N and go to step 5.
  8. Increment c by 1. If c < maxneighbors, go back to step 6.
  9. If the total dissimilarity of H is less than that of B, set B to H.
  10. Increment f by 1. If f  < numlocal, go back to step 4.
  11. Output B.

That's all, folks!

In the next post, we will be covering a method that can be used to cluster datasets with a mixture of numeric and categorical data. Stay tuned.