Let’s Get Fuzzy with C Means Clustering

Not every dataset you will encounter in the wild will have data points that can be neatly placed in a single cluster. You may come across data points that could potentially belong to two, three, or possibly more clusters.

Consider this scenario. You work as a data analyst for a company that has developed a fitness accountability app. During the first week of January, the company experienced a large number of new app installs. These installs are presumably from people looking to fulfill their new year resolution fitness goals. At the end of the month you gathered a week's worth of activity data from several of these new users. Below is a scatter plot of this data.

Data points in red represent users who deleted the app after 1 week of usage. The data points in green are users who are still actively using the app. From the plot it is clear that there's a relationship between level of app engagement and likelihood to churn. Users furthest to the left are more likely to delete the app after a week, while users furthest to the right are more likely to remain active on the app. A high degree of uncertainty, however, exists for users in the middle of the plot. These users could either churn or remain active. 

Let's now suppose you were given the task to group these users into a "churn" and "not churn" categories. You could use a clustering method like K-means to group the users, but information about the degree of uncertainty in the categories would be lost. A new clustering method is needed that would allow you to model the degree a data point is similar to or dissimilar from users that churn and remain active.

Enter Fuzzy C Means Clustering

C Means is a k-means variation that assigns each data point probabilities of belonging to each cluster. C Means differs from k-means in two key ways:

The first difference is that c means constructs a table (i.e. matrix) that contains the probability that each datapoint belongs to a particular cluster. Here's what the table would look like for a few data points in the problem outlined above:

UserChurnNot Churn
A0.250.75
B0.910.09
C0.510.49
D0.170.83

Each entry in the table has a number between 0 and 1. The larger the number, the more likely that data point belongs to the cluster. The sum of probabilities for each datapoint must also equal 1. If you look at the probabilities for each data point in the data table above, you will see that they indeed do.

This table, which we will call the fuzzy partition table, is randomly initialized at the start of c-means algorithm and then repeatedly updated, along with the centroid. Each entry in the table is updated using the following equation:

dist(xi,cj) and dist(xi,cq) represents the Euclidean distance between a datapoint xi and a cluster centroid cj and between xi and cq. You may have also noticed the m exponential term in the numerator and denominator. This is a special parameter that controls the fuzziness of the results. When m is 0, the equation will produce a binary output (1 or 0), akin to what you'd get if K-means was applied. As a matter of fact, you can think about K-means as C-means with the m parameter set to 0.  The larger m is the fuzzier the output is.

The second key difference lies in the calculation of the centroid on each iteration of the algorithm. The formula is as follows

If you read my previous post on weighted kmeans, you'll notice that this is essentially the weighted mean.

With that, here's the steps of the C means algorithm:

  1. Choose c, the number of clusters, and m
  2. Randomly select c data points from the dataset to be initial centroids
  3. Randomly choose values for each entry in the fuzzy partition table, subject to the constraint that the values sum to 1.
  4. Compute new centroids
  5. Update the fuzzy partition table
  6. Repeat step 4 and 5 until either there's no change in the values in the table or the difference is smaller than some threshold.

Choosing the C in C-Means

For C means you can use the Elbow method to find the optimal value for c. Below is a modified version of SSE to use when choosing the optimal value for c.

That's all folks!

In my next post I will discuss methods for clustering high dimensional data.