Faster Mediods Data Clustering with CLARA
Previously I introduced K-Medoids clustering, a data clustering method that can be used to cluster datasets containing outliers. The Partition Around Medoids (PAM) algorithm I described in my post is computationally slow and expensive and is typically used on small datasets because of that. What if we have a very large dataset that we'd like to cluster using K-Medoids. Do we have no other choice but to sit around a wait for the time consuming clustering method to finish? Not at all! In this post I will introduce you to CLARA, a K-Medoids implementation that can be on large datasets. Let's get started!
CLARA Data Clustering .. PAM, But with A Twist
CLARA is short for Clustering LARge Applications. The PAM Algorithm computes medoids by finding the data point with the shortest distance from all the other points in its cluster. Computing medoids using PAM is very quick for small datasets. But when it comes to much larger datasets, this could take quite a bit of time.
CLARA addresses this issue by performing PAM on a randomly selected sample of data points. We will call the size of this smaller dataset s. The idea behind this approach computing the medoids of this sample would give us an approximation of the medoids if the entire dataset was clustered, assuming that data points from the sample is representative of the entire dataset .
This clustering of sampled data points is performed not once, but several times. In order to ensure that the best possible clustering is achieved, CLARA draws several random samples of data points, performs PAM on each, and then determines the clustering that produces the best results. We will call the number of random samples and clustering performed n.
How does CLARA determine which clustering result is the best, you ask?
That all depends on the dissimilarity (i.e. distance) method that was used to form the clusters. If Euclidean distance was used, than best is defined as the result with the smallest sum of squared error. If Manhattan distance was used, then the total absolute difference is used. See my articles on K-Means, and K-Medians for more information about these measures.
The medoids from the best clustering is then used to assign the data points from the entire dataset to the appropriate cluster.


Step 1: Sample data points


With that, the entire process is outline in the following steps
- Select n samples of the dataset, each with the size s.
- Perform PAM on each sample
- Determine the PAM clustering that produces the most optimal results.
- Use the selected PAM clustering to assign each datapoint from the entire dataset to the appropriate centroid.
With the CLARA implementation, we have an approach for efficiently computing approximate medoids for our dataset. There are some disadvantages to this approach:
The quality of the results returned by CLARA is heavily dependent on the data sampled. If it turns out that there was some bias in the selection of the data points, then the clustering results will not be good.
The effectiveness of CLARA also depends on the sample size. CLARA can't produce a good clustering if any of the medoids sampled is far from the actual medoids. This will most definitely the case if the sample sizes are too small. On the other hand, sample sizes that are too large would result in longer processing times. Therefore, when using CLARA one must choose a sample size that provides an acceptable tradeoff between speed and accuracy.
Coming up next
My next post will cover the CLARANS data clustering algorithm, another K-Medoids implementation. Until next time.