PCA and the Nuts and Bolts of It

Clustering high dimensional datasets can be problematic. In my last blog post, I mentioned a few methods that can be used to clustering data with many dimensions. Today, we will be exploring one of these methods more in detail. By the end of this post you will gain an intuitive understanding of Principal Component Analysis (or PCA) and learn how to use it effectively.

What is PCA?

Principal component analysis is a technique that allows users to reduce the number of dimensions of the dataset. Given a dataset with numerous attributes, PCA creates another dataset that has significantly less attributes. The amazing part about this smaller dataset is that it retains a significant amount of the information contained in the original dataset. Clustering algorithms can then be performed on this dataset.

To give you a sense of how PCA is performed, consider this little toy scenario:

Let's say you are a young professional who lives in the San Francisco Bay Area. As a multi-hobbyist, you partake in over 20 recreational activities on a weekly basis. It just so happens that you, like many bay area residents, also work at a technology startup -- and a rather demanding one at that. For the sake of both your health and your career, you decided to cut back on the amount of activities you take part in. Deciding on which activities to drop, however, is a very difficult task for you -- you enjoy them all very much. You found an app that can help you pair your list down significantly. You punch in all 20 of your activities into the app and in a matter of seconds it outputs a list of five activities. These activities provide a significant majority of all the enjoyment you experience on a weekly basis.

The five activities are known as principal components.

PCA: 1st Principal component

Pictured above is a green line superimposed on a scatter plot of points. This green line is the first  principal component. The principal component is chosen so that all the data points are spread out along the line as much as possible.

PCA: 1st and 2nd Principal Components

Now pictured is the same green line with the addition of another line along another direction. The additional line is the second principal component. Chosen so that it maximizes the spread of the data in another direction.

The goal of principal component analysis is to produce one or more of these lines. Each line represents the maximum spread of the data in one direction (i.e. dimension). After we find all of these lines, we choose the few of the lines that cover the majority of the variance of the data and throw out the rest. These selected components are then used to transform our dataset into a lesser dimensions. 

Now for some math

Now that you understand principal component analysis at a high level, let me show you what it looks like mathematically.

A dataset is represented mathematically as a matrix with n rows and p columns. 

PCA begins with a simple rescaling of all the attributes in the dataset. As the goal is to select features that explain the most variance in the in dataset, rescaling the data ensures that this process isn't impacted by different scales in the data. You can check out this post [link to post] for more information about data scaling.

Using the rescaled matrix, a covariance matrix is computed. A covariance matrix is a square matrix that describes the variance present in the attributes in the dataset and the covariance between two attributes. The covariance matrix is computed as follows:

The elements on the diagonal of the matrix from the left to the right represent the variance of a single attribute.  The non diagonal elements represent the covariance between two attributes.  Both are computed as follows:

The covariance matrix describes the relationships between each attribute in the dataset. Using the covariance matrix, we can determine how two attributes vary in relation to one another, and also identify attributes that are highly correlated.

With the covariance matrix in hand, the next step is to decompose the matrix into a set of eigenvalues and eigenvectors. An eigenvector is a vector that retains its direction when it is transformed by a matrix.

In the image above, the notice that every line, except the green line change direction after the image is transformed.  The green line is an eigenvector. The green line merely stretches in size as a result of the transformation. The amount by which the green line stretches is called an eigenvalue.

This stretching is represented in linear algebra with the following equation:

M in the formula above represents a square matrix, v represents a vector, and lambda represents the eigenvalue. The product of multiplying the matrix by the vector is the same as multiplying the vector by the eigenvalue.

Another way to think of eigenvectors is as the north and south pole axis on a globe. As you rotate the globe by this axis, every point on the globe changes direction except for the points along the axis. 

Finally, we take the p eigenvalues and corresponding eigenvectors that were computed and order the eigenvectors by decreasing eigenvalue. Eigenvalues represent the variance of the dataset along a particular dimension. Therefore, we are essentially ordering the results by decreasing variance.

The first eigenvector is known as the first principal component. It is the vector that extracts the largest amount of variance present in the dataset, out of all the other eigenvectors computed.

The second eigenvector is known as the second principal component and extracts the second largest amount of variance.

The third eigenvector is known as the third principal component and extracts the third largest amount of variance.

And so on…

From this point, we decide on the number of principal components we want to keep. Let's call this number k.  The selected components are then used to create an a p by k matrix of principal components. Let's call this matrix P.  P is used to transform our dataset into a lower dimensional n by k matrix. This is done by multiplying our dataset by P.

This entire process is summed up in the following steps:

  1. Rescale dataset (i.e. normalize or standardize)
  2. Compute covariance matrix of dataset.
  3. Find eigenvalues and eigenvectors of covariance matrix.
  4. Order eigenvectors in decreasing order by eigenvalues
  5. Choose k eigenvectors to keep

How many principal components do I keep?

Now that you know how to compute principal components, you're probably wondering how to decide how to determine the optimal number of principal components to keep for dimensionality reduction. You can determine this number using one of two methods.

Method #1: Selection by Variance Explained

In the first approach, you retain only the principal components that explain a desired amount of variance in the data.

  1. Determine the percentage of total variance you want to retain.
  2. Sort the eigenvalues in decreasing order
  3. Divide each eigenvalue by the sum of eigenvalues to get the percentage of variance explained.
  4. Add the eigenvalues until you reach the desired amount of variance.

Here's an example…

Suppose after performing PCA on a dataset of 30 features, you obtained the following eigenvalues:

{5.2090, 99.0224, 33.1024, 17.7019, 54.8435, 91.7130, 5.6278, 10.5985, 101.1147, 109.3971, 78.7155, 49.7158, 39.2546, 57.7061, 44.6071, 39.3120, 26.8996, 1.7817, 57.3353, 85.5184, 112.0761, 27.1018, 102.6468, 74.4120, 13.5147, 25.4182, 23.9367, 6.9049, 76.5605, 63.62}

If you wanted to keep the principal components that explain 80% of the variance in the dataset here's what the process would look like.

1. Sort the eigenvalues in descending order:

{112.0761, 109.3971, 102.6468, 101.1147, 99.0224, 91.713, 85.5184, 78.7155, 76.5605, 74.412, 63.62, 57.7061, 57.3353, 54.8435, 49.7158, 44.6071, 39.312, 39.2546, 33.1024, 27.1018, 26.8996, 25.4182, 23.9367, 17.7019, 13.5147, 10.5985, 6.9049, 5.6278, 5.209, 1.7817}

2. Compute the percentage of variance explained by each eigenvalue:

{0.072996241, 0.071251383, 0.066854847, 0.065856976, 0.064494241, 0.059733558, 0.055698956, 0.051268162, 0.049864589, 0.048465251, 0.041436317, 0.037584538, 0.037343032, 0.035720099, 0.032380378, 0.029053033, 0.025604283, 0.025566898, 0.021559911, 0.017651663, 0.017519968, 0.016555118, 0.015590203, 0.011529418, 0.008802254, 0.006902905, 0.004497228, 0.00366544, 0.003392672, 0.001160438}

3. Add eigenvalues until you reach the desired amount of variance.

A running total of the above numbers would produce the following:

{0.072996241, 0.144247624, 0.211102471, 0.276959447, 0.341453688, 0.401187246, 0.456886202, 0.508154364, 0.558018953, 0.606484204, 0.647920521, 0.685505059, 0.722848091, 0.75856819, 0.790948568, 0.820001601, 0.845605884, 0.871172783, 0.892732694, 0.910384357, 0.927904325, 0.944459443, 0.960049645, 0.971579063, 0.980381317, 0.987284222, 0.99178145, 0.99544689, 0.998839562, 1}

From the result above, you can see that 80% of the variance is explained by the first 16 components. Those are therefore the principal components we will keep for dimensionality reduction.

Method #2: Use a scree plot

A scree plot displays the eigenvalues associated with a component in descending order by the number of that component. The optimal component is the point on the plot that looks like an elbow.

When not to use PCA

Principal component analysis projects a dataset into a lower dimensional subspace. As a result of this process, the features in the dataset are morphed into a form that can't be interpreted. PCA should therefore not be used in situations where interpretability of the features being analyzed is necessary.

That's all, folks!

Next post I will discuss another dimensionality reduction technique called singular value decomposition.

P.S in case you'd like to explore eigenvalues and eigenvectors deeper, I found an excellent resource that provides an even more intuitive explanation of the concept. Check it out here.