Scaling your data before clustering

You will often need to perform some form of preprocessing on your dataset before running a data clustering algorithm. In this post I will introduce one important preprocessing method that is often overlooked when clustering data. That method, my friends, is called data scaling.

What is data scaling?

As you may already know, clustering algorithms work by computing distances (i.e. dissimilarities) between data points in the dataset and grouping together points that are close in proximity. The method used for calculating the distance will be different depending on the algorithm used. One trait most of these methods share is high sensitivity to attributes (also called features) with numerically larger values compared to other attributes in a dataset. These features will bias the distance calculations and can cause the clustering algorithms to produce subpar clusterings.

To illustrate this, let's say you have a dataset containing SAT and GPA scores for 5000 college students that you'd like to segment into 4 groups.

StudentSATGPA
114191.35
213841.16
37843.58
47483.31
511240.98
49966832.02
49977393.58
49987413.21
499910881.01
50005951.60

After running K-means on the dataset, you create a scatter plot showing the student scores and the clusters each student is assigned to.

It's immediately apparent looking at the plot that a large number of students were mislabeled by the algorithm. SAT scores have a significantly larger numerical scale (400 to 1600) than that of grade point averages (0 to 4). SAT scores therefore have a much larger influence on the  distance measurement K-means uses to group the students.

As can be seen from the plot, K-means leaned heavily on the SAT scores to cluster the data. To ensure that both GPA and SAT has an equal weight in the clustering both features need to be transformed so they are on the same scale. This is exactly what data scaling is for.

Here's the scatter plot we get after scaling the data set and performing k-means:

Student clusters after applying data scaling

Data Scaling Techniques

There are two main methods for scaling data. The first method is called normalization and the second method is called standardization.

Normalization

Normalization uses the minimum and maximum to transform features onto the same scale. Below is the formula used to normalize each feature:

Here's what the student scores dataset would look like when each feature is normalized:

StudentSATGPA
10.8491670.3375
20.8200000.2900
30.3200000.8950
40.2900000.8275
50.6033330.2450
.........
49960.2358330.5050
49970.2825000.8950
49980.2841670.8025
49990.5733330.2525
50000.1625000.4000

Normalization rescales the dataset features between 0 and 1 or -1 and 1. Normalization is typically used when the numerical distribution of the features is not known. Normalization fails however on datasets with outliers. Allow me to explain why.

Suppose you have a small dataset containing the following points.

XY
01
23
53
1013
1517
2030
2223
2419
99025
100027

As you can see from the table, the last two data points are outliers. Using normalization, the data points will transform into the following

XY
0.0000.000
0.0020.083
0.0050.083
0.0100.500
0.0150.666
0.0200.458
0.0220.583
0.0240.708
0.9900.833
1.0001.000

Even though normalization shifts the values between 0 and 1, the outliers still remain as outliers. If we were to compare these points using a distance function, the [] feature would have more influence on the measurement.

Standardization

Standardization uses the mean and standard deviation to transform a dataset. The formula for transforming each feature is shown below:

Here, µ is the mean of the feature and σ is the standard deviation of the feature. This is what the same student score dataset from earlier would look like when standardized.

StudentSATGPA
11.393781-0.447397
21.283359-0.634040
3-0.6095871.743211
4-0.7231631.477980
50.463083-0.810861
.........
4996-0.9282320.210768
4997-0.7515571.743211
4998-0.7452481.379747
49990.349506-0.781391
5000-1.205864-0.201813

Standardization is very helpful in cases where the features in the dataset are normally distributed. But your data doesn't necessarily have to be normally distributed in order to use standardization. Unlike normalization, standardization does not squeeze your data into a bounding range, meaning that it will still work if there's outliers present.

That's all, folks!

There are many other data scaling techniques that can be used, but normalization and standardization are the two methods that are mainly used in practice. In my next post I will discuss fuzzy clustering.