High Dimensional Clustering 101
High dimensional data are datasets containing a large number of attributes, usually more than a dozen. There are a few things you should be aware of when clustering datasets such as these. The goal of this post is to highlight a few strategies you can use when performing high dimensional clustering.
Setting the Stage
To set the stage for the discussion take a look at the image below.

Which one of the red points would you say is the closest to the x in the center? As you can see, there's no clear answer. Some of you reading this may even say that is a nonsensical question to be asking, because these points are roughly equidistant from x. Even if there is a single point that is closer than the others, it would only be by a measly, negligible amount.
When a data clustering algorithm looks at high dimensional data, the data points could look much like the above image. The results we get back from the algorithms may not capture the patterns present in the dataset. This is known in machine learning as the curse of dimensionality.
Dum Dum Duuuuuuuuumm… The curse of dimensionality
The curse of dimensionality is a very interesting, and counterintuitive phenomenon that arises when machine learning algorithms are applied to datasets with a large number of features.
Data clustering algorithms work by computing distances between data points and grouping together points that are close together in proximity. When the number of features in a dataset is small, the algorithms are able to clearly the data points that are close together from the ones that are not. For datasets with many features, the task of identifying points that are closer together is more challenging. The reason for this is because the distance between points in high dimensions are far away. This is true even for points that are similar to each other.
Pictured below are two charts. The chart on the left is a 1-dimensional plot of several data points. The chart on the right is a histogram of pairwise distances between each data point pictured in the first plot.


As shown in the plot on the left, the dataset can be neatly grouped into three clusters. You can also see this from the three peaks in the histogram on the right.
Here's what the charts look like when we add a second dimension to the data.


With the addition of the second dimensions, the distances between each point is larger. Nevertheless a clustering algorithm can still easily cluster these points into 3 groups.
The same observation can be made if a third dimension was added to the data.


As more dimensions are added to the dataset the distances between the points become larger and the peaks we observed in the histogram become narrower. After 100,000 dimensions we'll get a histogram shaped like the one below:

Wait, isn't there supposed to be three peaks? At 100,000 dimensions, the distance between the data points became so large that the third cluster we saw in the first 3 plots became indistinguishable from the other two. If we were to run K-means on this dataset, the clusters it would give us would definitely not be what we'd expect.
Now that you are aware of the curse of dimensionality I will briefly highlight in the next few sessions some methods that you can use to work around it.
Use Feature Selection
You don't need to cluster with every attribute in your dataset. If you're only interested in a handful of attributes, you can cluster on just those attributes and discard the others. You should also examine your dataset for features that show very little variation or are highly correlated with other features. Those features will not provide any meaningful information to a clustering algorithm and can therefore be discarded.
Use Dimensionality Reduction
Another approach for clustering high dimensional data entails projecting the dataset to a lower dimensional space before applying a clustering algorithm. Oftentimes a dataset can be approximated very well using a smaller number of dimensions without losing much information. This can be accomplished by applying techniques from linear algebra such as Principle Component Analysis and Singular Value Decomposition. I will explore both topics more in detail in future posts.
Use a Graph-based Clustering Algorithm
Graph based clustering algorithms such as Spectral clustering uses a different criteria for forming clusters that is more robust in high dimensions than K-means and other related methods. I will explore spectral clustering more in detail in a future post.
That's all folks!
In my next post, I explore the topic of dimensionality reduction further by discussing the nuts and bolts of Principle Component Analysis. Stay tuned.