K-Means.. The Simple Clustering Algorithm

If you do a quick Google search for clustering algorithms, k-means will undoubtedly be the first algorithm that will be mentioned in the search results, and for good reason too. It's the first algorithm that most professionals turn to for segmenting datasets.  Compared to other clustering methods out there, k-means is both very simple to implement and understand. In this post I will explain how this algorithm works and explain some pros and cons that you should be aware of when using this algorithm. In future posts, I will delve into more technical aspects of k-means and also walk through a coded implementation of this algorithm.

Here goes…

So what exactly is k-means?

The k-means algorithm divides a dataset into groups. The datapoints in each group are in close proximity of each other (at least as close to each other as possible). The k in k-means comes from the choice in the number of clusters to create. When using this algorithm, the practitioner provides two things:

  1. A dataset, and
  2. The number of clusters to create. We call this number k.

The algorithm will then take the dataset and create the desired number of clusters. The means in k-means comes from the method in which the clusters are formed. The algorithm repeatedly estimates the center of each cluster by calculating the mean (average) of the data points in the cluster.  Another name for this center point is centroid. Each data point in the dataset is assigned a cluster based on how close it is to a centroid. This will make more sense once we go through the next section.

Before K-Means
After K-Means

So how exactly does it work?

Here's a breakdown of the clustering algorithm in all of it's parts.

Choose initial centroids
Assign points to cluster
Calculate new centroids
Repeat until done
  1. Choose k. The first step is to determine the number of clusters you'd like to create. We'll call this number k
  2. Select initial cluster centroids. In this step the algorithm chooses k data points from the dataset at random. These data points serve as center points (centroids) for the clusters.
  3. Assign data points to a cluster.  In this step the algorithm calculates the distance between each data point in the dataset and each centroid and assigns the data point to the centroid it is closest to.
  4. Calculate new cluster centroids. The algorithm calculates the new center points for each cluster by computing the calculating the average of each data point in the cluster.
  5. Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new centroids. After the reassignment new cluster centroids are computed. This process is repeated until no changes in the assignments are made.

But wait, how is distance between a data point and a centroid calculated?

There are many methods that can be used to calculate distance (i.e. similarity) between points, many of which I plan on covering at a later time. But the method most commonly used is Euclidean distance. The formula is as follows:

So, what are the advantages of k-means?

Very Simple to Implement. This is probably one of the biggest advantages of this algorithm. Not only is it very easy to implement, it is also very easy to interpret and visualize compared to other algorithms out there.

Very efficient. The k-means clustering algorithm is highly scalable and can efficiently cluster large datasets.

Highly adaptable to new data points. K-means clustering algorithm is extremely flexible and can adjust the clustering of the data as new data points are added to the dataset. This is really great for real-time applications where new data points can be inserted to the dataset on the fly.

And, what are the disadvantages of k-means?

K must be chosen manually. As you may have already surmised from the explanation of the algorithm, K-means is not capable of determining the optimal number of clusters in the dataset automatically. You must specify the number of clusters in advance. In a future post I will cover some strategies you can use to determine the optimal number of clusters for K-means

Very sensitive to noisy data and outliers. The presence of extreme data points in the dataset can have a huge impact on the quality of the clusters that are formed. The placement of cluster centroids, for instance can be heavily influenced by outliers. Or better yet, outliers might even get their own cluster instead of being ignored. To prevent any of these from occurring, make sure to remove these outliers from the dataset before clustering.

Dataset with outliers before clustering
Dataset with outliers after clustering

Results are inconsistent. Due to the fact that the initial center points are chosen at random, you can run k-means multiple times on the same dataset and get different answers. What people therefore do in practice is run k-means several times with different initial values and pick the one with the best result.

Works best with spherical datasets. One big assumption K-means has on the dataset is that it contains spherically shaped clusters. This is not always the case however. When given datasets such as the one below, K-means may produce a result that might not be what you'd expect.

Non spherical dataset before clustering
Non spherical dataset after clustering

Assumes clusters have similar density. Another big assumption that K-means makes on the dataset is that the clusters are even in size and density. When given dataset with uneven distributions of points per cluster, the clustering starts to break down.

What's next?

You now have a high level understanding of k-means clustering and how it works. Other the next few posts, I will start diving into more technical details on the algorithm, including strategies for choosing the optimal number of clusters, and other distance methods you can use with k-means.