Lecture Notes



Steven Holland

PCA: Matrix Operations

R's prcomp() function, and other like it, are front ends that perform the matrix operations in a principal components analysis. These matrix operations are not complex, and it is straightforward to compare their results to those of prcomp().

First, we will create a two-dimensional data set that we will apply PCA. This could be done for data sets with more variables, but using just two makes it easier to inspect the results.

rawX <- rnorm(50, sd=3)
errors <- rnorm(50, sd=3)
slope <- 0.8
rawY <- slope * rawX + errors
raw <- cbind(rawX, rawY)
remove(rawX, rawY, slope, errors)

Next, we will run prcomp() and extract the loadings, singular values, and scores. The center argument moves the centroid of the data to the origin, and retx argument ensures that the scores are returned. Remember that the loadings are the eigenvector and singular values are the standard deviation along each principal component.

pca <- prcomp(raw, center=TRUE, retx=TRUE)
loadings <- pca$rotation
singular <- pca$sdev
scores <- pca$x

Next, we will calculate all of this using the underlying matrix operations. To do this, we need to center the data, that is put the centroid of the cloud of data at the origin, and this is done by subtracting the column mean from every value in each column in the raw data. From this, we calculate a covariance matrix, and from that we calculate the eigenvalues and eigenvectors using singular value decomposition.

centered <- apply(raw, MARGIN=2, FUN=function(x){x-mean(x)})
covMatrix <- cov(centered)
eigen <- eigen(covMatrix)

From this, we can calculate the singular values, the loadings, and the scores. Because the eigenvalues are the variance along each principal component, the singular values are the square root of the eigenvalues. Thus, the singular values are the standard deviation along each principal component.

singularLong <- sqrt(eigen$values)

The loadings are the eigenvectors.

loadingsLong <- eigen$vectors
rownames(loadingsLong) <- colnames(raw)

The scores are the centered values of the raw data, post-multiplied by the loadings.

scoresLong <- centered %*% loadingsLong

Comparison of results

The singular values for both approaches are identical.

> singular
[1] 4.714315 2.268171
> singularLong
[1] 4.714315 2.268171

The absolute values of the loadings are also identical. Note that the orientation of the PCs is arbitrary, so that the signs of the loadings for each PC might be reversed between that produced by prcomp() and that produced by the matrix operations, as they are here.

> loadings
            PC1        PC2
rawX -0.5369201  0.8436331
rawY -0.8436331 -0.5369201
> loadingsLong
          [,1]        [,2]
rawX  0.5369201 -0.8436331
rawY  0.8436331  0.5369201

Last, the absolute values of the scores are also identical. If the signs for the loadings are reversed, the signs for the scores will also be reversed, as they are here. Remember that in any PCA, you can reverse the signs for the loadings and the scores for any of the PCs, such as if you are comparing PCAs of different data sets and you want the patterns to go in the same direction. Just be sure that if you change the signs, you change the signs for all loadings on a given PC and for all the scores on that same PC.

> head(scores)
            PC1        PC2
[1,] -3.4573188  1.9034691
[2,] -3.7136078  0.2395753
[3,]  6.7840044 -0.8680275
[4,] -7.0073627 -2.9162746
[5,]  0.2528793  0.8882884
[6,]  1.5870900 -1.6455650
> head(scoresLong)
           [,1]       [,2]
[1,]  3.4573188 -1.9034691
[2,]  3.7136078 -0.2395753
[3,] -6.7840044  0.8680275
[4,]  7.0073627  2.9162746
[5,] -0.2528793 -0.8882884
[6,] -1.5870900  1.6455650