Weighing your k-means data clustering
In my previous post on the K-Prototypes algorithm, I introduced the concept of parameter weighting. With just a single parameter, you can control the amount of influence categorical attributes has on the data clustering. In this post, I'm going to expand on this concept and show how we can control the weightings of entire data points. There are some data clustering problems you will encounter in the real world where this technique is essential.
Customer Segmentation Case Study
Consider the following scenario. You were hired by a growing online distributor of flavored milk to help them understand the shopping habits of their customers. The company would like to segment their customers into different groups. In particular, the company would like to understand the representative spend share for each customer group it can effectively firm up its marketing strategy. You are handed a dataset containing the percentage of sales in five different milk products over the past year.
CustomerID | Organic Milk | Condensed Milk | Almond Milk | Coconut Milk | Skim Milk |
1 | 13.45 | 26.05 | 11.14 | 4.99 | 27.44 |
2 | 24.73 | 19.52 | 16.58 | 3.49 | 23.65 |
3 | 25.50 | 20.47 | 18.14 | 2.66 | 23.49 |
4 | 4.5 | 20.10 | 7.36 | 9.27 | 13.47 |
5 | 26.47 | 21.09 | 18.4 | 1.90 | 21.75 |
6 | 9.43 | 15.37 | 11.89 | 1.78 | 25.81 |
7 | 4.83 | 21.11 | 5.48 | 9.47 | 13.80 |
8 | 5.76 | 18.09 | 7.42 | 9.91 | 11.25 |
9 | 5.36 | 18.27 | 4.84 | 9.97 | 13.01 |
10 | 7.99 | 20.17 | 6.63 | 7.9 | 12.39 |
… | … | … | … | … | … |
Customer segmentation problems such as this one is an excellent use case for K-means clustering. The centroids (i.e. center points) produced by k-means can provide the representative customer characteristics that your client is looking for. Taking a look at the distribution of total spend for each customer, you notice that the numbers are heavily skewed.

Some customers a high spend share in one or more categories, but their total spend is so low that they are less likely to respond to the company's marketing campaign. Because K-means weighs each data point evenly, it will not take the customer total spend into account and will produce clusters and center points that are misleading.
In order to get the output we're looking for, we need to weigh each customer differently. Customers who have a higher total spend should have more influence on the cluster centroids than customers who do not. This is exactly what weighted k-means clustering allows us to do.
About Weighted K-means clustering
To perform weighted k-means clustering, we need to make a minor tweak to the way the cluster centroids are calculated after each iteration. Instead of using the mean the calculate the centroids, we use the weighed mean:

Each data point xi in the dataset is assigned a weighted value wi. Computing the weighted mean for a cluster is just a matter of:
- Multiplying each data point by its weight
- Computing the sum the products calculated in step 1 together, and
- Dividing the sum calculated in step 2 by the sum of all the weights.
As usual, here's the weighted k-means clustering procedure, taking in account the new centroid calculation discussed above:
- Choose k. Determine the number of clusters to form
- Select initial centroids. In this step the algorithm chooses k data points from the dataset at random to be the initial centroids.
- Assign data points to a centroid. In this step the algorithm calculates the Euclidean distance between each data point and each centroid. Data points are assigned to the centroid with the smallest distance.
- Calculate weighted centroids. New centroids are calculated by finding the weighted mean of each cluster
- Repeat steps 3 and 4. Repeat steps 3 and 4 until no changes in the assignments are made.
Weighted K-means and Outliers
Weighed k-means clustering is also useful when clustering datasets with outliers. By assigning small weight values to outlier data points, k-means will form clusters that are robust to these extreme values.
That's all folks!
I hope you found this guide on weighted k-means useful. In my next post, I will discuss data scaling, an important but often overlooked concept that has a huge impact on the data clustering algorithms.