Steven Holland

# Principal Components Analysis

## Introduction

Suppose we had measured two variables, length and width, and plotted them as shown below. Both variables have approximately the same variance and they are highly correlated with one another. We could pass a vector through the long axis of the cloud of points and a second vector at right angles to the first, with both vectors passing through the centroid of the data.

Once we have made these vectors, we could find the coordinates of all of the data points relative to these two perpendicular vectors and replot the data, as shown here (both of these figures are from Swan and Sandilands, 1995).

In this new reference frame, note that variance is greater along axis 1 than it is on axis 2. Also note that the spatial relationships of the points are unchanged; this process has merely rotated the data. Finally, note that our new vectors, or axes, are uncorrelated. Mathematically, the orientations of these axes relative to the original variables are called the eigenvectors, and the the variances along these axes are called the eigenvalues.

By performing such a rotation, the new axes might have particular explanations. In this case, axis 1 could be regarded as a size measure, with samples on the left having both small length and width and samples on the right having large length and width. Axis 2 could be regarded as a measure of shape, with samples at any axis 1 position (that is, of a given size) having different length to width ratios. These axes contain information from our original variables, but they do not coincide exactly with any of the original variables.

In this simple example, these relationships may seem obvious. When dealing with many variables, this process allows you to assess any relationships among variables very quickly. For data sets with many variables, the variance of some axes may be great, whereas the variance on others may be so small that they can be ignored. This is known as reducing the dimensionality of a data set. For example, you might start with thirty original variables, but might end with only two or three meaningful axes. The formal name for this approach of rotating data such that each successive axis displays a decreasing amount of variance is known as Principal Components Analysis, or PCA. PCA produces linear combinations of the original variables to generate the axes, also known as principal components, or PCs.

## Computation

The data set should be in standard matrix form, with n rows of samples and p columns of variables. There should be no missing values: every variable should have a value for every sample, and this value may be zero. The data are first centered on the means of each variable. This insures that the cloud of data is centered on the origin of our principal components, but it does not affect the spatial relationships of the data nor the variances along our variables. The first principal component (Y1) is given by a linear combination of the variables X1, X2, ..., Xp

or, in matrix notation

The first principal component is calculated such that it accounts for the greatest possible variance in the data set. Of course, one could make the variance of Y1 as large as possible by choosing large values for the weights a11, a12, ... a1p. To prevent this, the sum of squares of the weights is constrained to be 1.

The second principal component is calculated in the same way, with the conditions that it is uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next highest variance.

This continues until a total of p principal components have been calculated, equal to the original number of variables. At this point, the total variance on all of the principal components will equal the total variance among all of the variables. In short, all of the original information has been explained or accounted for. In matrix notation, the transformation of the original variables to the principal components is written as

Calculating these transformations or weights requires a computer for all all but the smallest matrices. Specifically, these are calculating through singular value decomposition. The rows of matrix A are called the eigenvectors, and these specify the orientation of the principal components relative to the original variables. The elements of an eigenvector, that is, the values within a particular row of matrix A, are the weights aij, and are also known as loadings. Large loadings indicate that a particular variable has a strong relationship to a particular principal component.

From matrix A and matrix Sx, the variance-covariance matrix of the original data, one can calculate the variance-covariance matrix of the principal components:

The elements in the diagonal of matrix Sy are the eigenvalues, that is, the variance explained by each principal component. These are constrained to decrease monotonically from the first principal component to the last. Eigenvalues are commonly plotted on a scree plot to show the decreasing rate at which variance is explained by additional principal components. The off-diagonal elements of matrix Sy are zero, because the principal components are constructed such that they are independent and therefore have zero covariance. The square roots of the eigenvalues (which are therefore the standard deviations of the principal components) are called the singular values (as in singular value decomposition).

The positions of each observation in this new coordinate system of principal components are called scores. These are calculated as linear combinations of the original variables and the weights aij. For example, the score for the rth sample on the kth principal component is calculated as

In interpreting the principal components, it is often useful to know the correlations of the original variables with the principal components. The correlation of variable Xi and principal component Yj is

Because reduction of dimensionality is a goal of principal components analysis, several criteria have been proposed for determining how many PCs should be examined and how many should be ignored. Four criteria are common, and all are best done by examining the scree plot:

1. Ignore principal components at the point at which the next PC offers little increase in the total explained variation.
2. Include all PCs up to a predetermined total percent explained variation, such as 90%.
3. Ignore components whose explained variation is less than 1 when a correlation matrix is used, or less than the average variation explained when a covariance matrix is used, with the idea being that such a PC offers less than one variable’s worth of information.
4. Ignore the last PCs whose explained variation are all roughly equal.

Principal component analysis is equivalent to major axis regression; it is the application of major axis regression to multivariate data. As such, principal components analysis is subject to the same restrictions as regression, in particular multivariate normality, which can be evaluated with the MVN package. The distributions of each variable should be checked for normality and transforms used where necessary to correct high degrees of skewness in particular. Outliers should be removed from the data set as they can dominate the results of a principal components analysis.

## PCA in R

1) Import the data, cull samples or variables as needed to remove missing values, and perform any data transformations that are necessary to remove outliers and achieve something approaching multivariate normality. Here, we will use the purdin geochemistry data set, a set of geochemical measurements on a series of Upper Ordovician limestone beds from central Tennessee. The first column contains the stratigraphic position, which we’ll use as an external variable to interpret our ordination. The remaining columns contain geochemical data, which is what we want to ordinate, and we’ll pull those off as a separate object. The first two columns of the geochemical data are stable isotopic ratios that need no transformation, but the remaining columns are weight percentages of major oxides, which should be log transformed. Finally, the log transformation of the potassium column contains many -INF results, caused by potassium values of zero. Here, we’ll choose to delete the potassium column; another option would be to keep it and use a log(x+1) transformation instead.

colnames(purdin)
geochem <- purdin[ , 2:10]
geochem[ , 3:9] <- log10(geochem[ , 3:9])
geochem <- geochem[, -6]

2) Run the PCA with prcomp(). We want all variables to have an equal effect on the ordination, rather than having numerically large variables dominate the results, so we set the center argument to TRUE, which sets the means of all variables equal to zero, and the scale. argument to TRUE, which sets the standard deviations of all variables to be equal to 1. We’ll assign the ordination result to a new object (of type prcomp), and then we’ll pull off several parts of this object (the standard deviation of each principal component, the loadings for each variable, and the scores for each sample) as their own objects, which will simplify our code later. This also lets us call them by their common names (loadings, scores), rather than by the more obscure names that R uses (rotation, x).

geochem.pca <- prcomp(geochem, retx=TRUE, center=TRUE, scale.=TRUE)
sd <- geochem.pca\$sdev
scores <- geochem.pca\$x

3) To interpret the results, the first step is to determine how many principal components to examine, at least initially. Remember, although PCA will return as many principal components as we have variables (eight, here), the point of PCA is to reduce dimensionality, so we will concentrate our initial interpretations on the largest principal components. To do this, we calculate the percent of total variance explained by each principal component, and make a bar plot of that. To this plot, we will add a line that indicates the amount of variance each variable would contribute if all contributed the same amount (that is, equivalent to criteria #3 above).

var <- sd^2
var.percent <- var/sum(var) * 100
dev.new()
barplot(var.percent, xlab='PC', ylab='Percent Variance', names.arg=1:length(var.percent), las=1, ylim=c(0, max(var.percent)), col='gray')
abline(h=1/ncol(geochem)*100, col='red')

The first thing we should look for in the scree plot is a marked drop in the percentage of explained variation, and ignore those principal components that explain relatively little variation. Here, principal components 1 and 2 explain a large proportion of the variance, with subsequent principal components explaining much less. This would suggest we should focus primarily on principal components 1 and 2.

A second criteria we should consider is which principal components explain more than one variable’s worth of information. Since we have eight variables, if each variable contributed equally, they would each contribute 12.5% to the total variance, indicated by the red line. This criterion suggests we should also include principal component 3 (barely) in our interpretation. When we report our results, we should state the explained variance of each principal component we used, plus their combined explained variance, so we’ll calculate those exactly:

var.percent[1:3]
sum(var.percent[1:3])

In this case, the first three principal components accounts for just over 75% of the explained variance.

4) The next step in our interpretation is to understand how our variables contribute to each of the principal components, and this is revealed by the loadings. Positive loadings indicate a variable and a principal component are positively correlated: an increase in one results in an increase in the other. Negative loadings indicate a negative correlation. Large (either positive or negative) loadings indicate that a variable has a strong effect on that principal component. We need to scan the table to find those large loadings, so we first need a criterion of what constitutes a “large” loading. Because the sum of the squares of all loadings for an individual principal component must sum to one, we can calculate what the loadings would be if all variables contributed equally to that principal component. Any variable that has a larger loading than this value contributes more than one variable’s worth of information and would be regarded as an important contributor to that principal component.

A table of loadings should always be presented for the principal components that are used. It is usually helpful to rearrange the rows so that all the variables that contribute strongly to principal component 1 are listed first, followed by those that contribute strongly to principal component 2, and so on. Large loadings can be highlighted in boldface to emphasize the variables that contribute to each principal component.

In this example, the loadings are readily interpretable. Axis 1 has strong positive loadings for calcium, and strong negative loadings for silicon, iron, and aluminum. As these are analyses of limestones, this likely reflects relatively pure limestone at positive axis 1 scores and increasing clay content at negative axis 1 scores. Axis 2 has strong positive loadings for d18O and magnesium, which may reflect dolomitization. The reason for the strong negative loading of manganese, although someone familiar with limestone diagenesis may have an idea of what it could reflect. Axis 3 is dominated by large positive loadings for d13C, which may reflect oceanographic processes controlling the carbon isotopic composition of these limestones.

5) The last step is to visualize the distribution of the samples in this new ordination space, which is just a rotation of our original variable space. The scores describe the location of our samples, and we’ll plot these as scatterplots, starting with the most important pair of principal components (1 and 2). This is called a distance biplot, and it shows individual samples as well as vectors that illustrate how an increase in a given variable “pushes” a sample in this space. R comes with a built-in function for making a distance biplot, called biplot.prcomp(). If we supply it a prcomp object, we can call it with biplot().

dev.new(height=7, width=7)

We can also generate this plot ourselves, which gives us greater control in what we show. In this example, we have a scaling variable, which controls the length of the vectors, and a textNudge variable, which controls how far beyond the vector tips the labels appear. If you use this code for your own PCA, you will need to adjust these values to get the vectors as long as possible while keeping their labels on the plot.

dev.new(height=7, width=7)
plot(scores[, 1], scores[, 2], xlab='PCA 1', ylab='PCA 2', type='n', asp=1, las=1)
scaling <- 5
textNudge <- 1.2
text(scores[, 1], scores[, 2], rownames(scores), col='blue', cex=0.7)

One advantage of this approach to making a biplot is that it allows us to color-code samples by external variables to see if samples sort according to those variables. Here, stratigraphic position was our external variable, and a major boundary occurs at 34.2, with samples below belonging to the Carters Formation and samples above belonging to the Hermitage Formation. We can code vectors for these and use these to plot their samples.

dev.new(height=7, width=7)
Carters <- purdin\$StratPosition < 34.2
Hermitage <- purdin\$StratPosition > 34.2
plot(scores[, 1], scores[, 2], xlab='PCA 1', ylab='PCA 2', type='n', asp=1, las=1)
points(scores[Carters, 1], scores[Carters, 2], pch=16, cex=0.7, col='blue')
points(scores[Hermitage, 1], scores[Hermitage, 2], pch=16, cex=0.7, col='red')
text(1, 3, 'Carters', col='blue')
text(-1, -4, 'Hermitage', col='red')

This shows the separation of sample in the two formations beautifully, with near-perfect separation. Interestingly, two Carters samples plot within the Hermitage field, and vice versa, and these suggest we should examine our samples again. In this case, the two Carters samples within the Hermitage field turned out to be intraclasts: pieces of eroded Carters rock that was deposited in the base of the Hermitage.

We can also combine this plot with a knowledge of our variables derived from the loadings. Both stratigraphic units show variation along principal component 1, which is controlled by lithology: carbonate-rich rocks on the right (high in Ca) and clay-rich rocks on the left (high in Si, Al, and Fe). The two formations separate along principal component 2, with Carters Formation samples having high d18O and Mg values, possibly indicating dolomitization, and Hermitage Formation samples having high d13C and Mn values, tied to a change in seawater chemistry. These annotations can be added to the plot to make the axes easily interpretable to a reader.

text(-1, -4.9, 'non-dolomitized', pos=3, col='gray')
text(-1, 3, 'dolomitized', pos=3, col='gray')
text(2.6, -1.8, 'clean limestone', pos=3, col='gray')
text(-5, -1.8, 'clay-rich limestone', pos=3, col='gray')

If you have multiple external variables, you can color-code each of those variables on a different plot to understand how those variables are arrayed in ordination space. You can also make similar biplots for the higher principal components (axes 3 and up).

## What to report

When reporting a principal components analysis, always include at least these items:

• A description of any data culling or data transformations that were used prior to ordination. State these in the order that they were performed.
• Whether the PCA was based on a variance-covariance matrix (i.e., scale.=FALSE was used) or a correlation matrix (i.e., scale.=TRUE was used).
• A scree plot that shows the explained variance of each of the principal components and that illustrates the criteria used for selecting the number of principal components to be studied.
• A table of loadings that shows the loadings of all of the variables for each of the principal components that was studied. The table should highlight (e.g., with boldface) those loadings that are considered the most important for each principal component.
• One or more plots of sample scores that emphasizes the interpretation of the principal components, such as color-coding samples by an external variable. It may be useful to show the vectors corresponding to the loadings on these plots.
• PCA results can be complex, and it is important to remember the GEE principle in your writing, which stands for Generalize, Examples, Exceptions. Begin each paragraph with a generalization about the main pattern shown by the data. Next, provide examples that illustrate that pattern. Finally, cover the exceptions to the pattern last. For a PCA, you might begin with a paragraph on variance explained and the scree plot, followed by a paragraph on the loadings for PC1, then a paragraph for loadings on PC2, etc. These would then be followed by paragraphs on sample scores for each of the PCs, with one paragraph for each PC. If you do this in all your writing about data, not just PCA, it will be much easier for your readers to follow.

## Applications

PCA is useful in any situation where you have many variables that are correlated to some degree, and you would like to reduce the dimensionality, that is be able to use fewer variables.

One common situation is in any multivariate data set, where reducing the dimensionality helps to identify patterns among the variables and to identify commonalities among the rows (or objects).

A second situation where this occurs is when you have many predictor variables in a regression and those variables show multicollinearity. One option is to remove variables, but if you have many variables, you may still be left with too many variables to make regression straightforward, given the multiple paths that are possible in backward elimination. A better option is to run a PCA on the predictor variables and use the first few PCs as the predictor variables in the regression. This removes the multicollinearity and makes the creation of a regression model simpler.

A third situation where this arises is when you have many time series from many geographic location, as is common in meteorology, oceanography, and geophysics, especially for the output of numerical models. PCA of this data can reduce the dimensionality of this data, making it much simpler to identify the important spatial and temporal patterns. This type of PCA is called an Empirical Orthogonal Function or EOF.

### Matrix operations

For the curious, it is straightforward to use matrix operations to perform a principal components analysis. Here's how.

## References

Legendre, P., and L. Legendre, 1998. Numerical Ecology. Elsevier: Amsterdam, 853 p.

Swan, A.R.H., and M. Sandilands, 1995. Introduction to Geological Data Analysis. Blackwell Science: Oxford, 446 p.