Lecture Notes

Home

Contact

Steven Holland

Covariance and Correlation

A common goal is to describe the strength of the linear relationship of two variables. Covariance and correlation are the two most common measures for doing this.

Covariance

Covariance is based on the sum of products, analogous to the sum of squares used to calculate variance:

sum of products

where i is the observation number, n is the total number of observations, j is one variable, and k is the other variable. This is just the deviations from the mean of one variable, multiplied by the deviations from the mean for the other variable, summed over all the objects.

The sum of products can also be calculated in this way:

alternative formulation of sum of products

Here, the first term is called the uncorrected sum of products and the second term is called the correction term. It is corrected in the sense that it subtracts the mean from each of the observations, that is, it is the sum of products relative to the means of each variable, just as we measured the sum of squares relative to the mean.

From the sum of products, we get covariance, which is analogous to variance. Note that the order of j and k has no effect on the result.

covariance

Unlike the sum of squares and variance, the sum of products and covariance can be positive or negative. Covariance has units equal to the product of the units of the two measurements, similar to those of variance. Also like variance, the value of covariance is a function of the scale on which the data were measured, which is generally undesirable. To convert standard deviation into a dimensionless description of variation, we calculate the coefficient of variation by dividing standard deviation by the mean. To convert covariance into a dimensionless value, we need to calculate Pearson’s correlation coefficient.

Pearson’s correlation coefficient

Pearson’s correlation coefficient is also called the product-moment correlation coefficient. Often, it is just called correlation coefficient, and when someone doesn’t specify which type of correlation coefficient, they usually mean Pearson’s.

To produce a dimensionless measure of correlation, we need to divide the covariance by something that has the same units, and that is the standard deviations of the two variables. This standardizes the joint variation in the two variables by the product of the variation in each variable. Correlation coefficient therefore measures the strength of the relationship between two variables, but it doesn’t depend on the units of those variables.

correlation coefficient

Pearson’s correlation coefficient varies from -1 (perfect negative correlation) to +1 (perfect positive correlation), with a value of 0 indicating no linear relationship. If one variable does not vary, one of the standard deviations will be zero, causing the denominator to be zero and making the correlation coefficient undefined.

Pearson’s correlation coefficient can also be calculated as the sum of products divided by the square root of the product of the two sums of squares:

alternative correlation coefficient

In R, covariance is calculated with cov(), and correlation is calculated with cor():

cov(x, y)
cor(x, y)

As always, we can make statistical inferences and test hypotheses about a statistic if we know its expected distribution. If the data are normally distributed, and if the null hypothesis is that there is zero correlation, Pearson’s correlation coefficient follows a t-distribution, where the standard error of Pearson’s correlation coefficient is:

standard error of correlation

The t-test for Pearson’s r has the standard setup, that is, the statistic minus the null hypothesis for the parameter, divided by the standard error of the statistic:

t-test of correlation

Since the null hypothesis is that the correlation is zero, the ρ term drops out, and the t-statistic is simply Pearson’s correlation coefficient divided by its standard error.

All t-tests assume random sampling and that the statistic is normally distributed. In this case, Pearson’s r will be normally distributed only if the null hypothesis is that ρ equals zero, and if both variables are normally distributed.

Finally, the test assumes that the correlation coefficient is calculated on interval or ratio data. If you have ordinal data, you must use a non-parametric test.

The default test of a correlation coefficient is usually two-tailed test (i.e., that it is not zero, that there is some correlation), but you can perform a one-tailed test if you have theoretical reasons for believing that the correlation ought to be either positive or negative. As always, this alternative hypothesis cannot be determined from your data; it must come from prior knowledge, such as a theoretical model or other data sets. In other words, you should not look at your data, see a positive correlation and then test for a positive correlation.

The t-test of Pearson’s correlation coefficient has n-2 degrees of freedom, because population variance is estimated for two variables, not just one.

The t-test for Pearson’s correlation coefficient is most easily performed in R with cor.test(x, y). As expected, since it is just a t-test, the output is identical to that of t.test().

cor.test(x, y)

Although it is more complicated, you can perform a t-test on Pearson’s r manually, using the t-score and standard error formulas above. It will produce the same results as running cor.test(), so there is usually no reason to perform it manually.

The square of Pearson’s correlation coefficient, R2, is a widely reported statistic with an extremely useful property: it measures the proportion of variation that is explained by a linear relationship between two variables. As a proportion, R2 is always between zero and one.

Spearman’s rank correlation

Pearson’s correlation coefficient measures how well the relationship between two variables can be described by a line, but there are other ways that two variables can be correlated, yet not in a linear relationship. For example, two variables might show an exponential relationship, or a logistic one. Both would be described as monotonic relationships: as one variable increases, the other consistently increases or consistently decreases. Spearman’s rank correlation coefficient is a non-parametric test of monotonic correlation.

The Spearman correlation test assumes only that your data are a random sample. Spearman’s is calculated on the ranks of the observations and will therefore work with any type of data that can be ranked, including ordinal, interval, or ratio data. Because it based on ranks, it is less sensitive to outliers than the Pearson correlation test, and it is sometimes used to evaluate a correlation when outliers are present.

To understand how Spearman’s correlation works, imagine ranking all of your x observations, assigning 1 to the smallest value, 2 to the next smallest, and so on, and do the same for your y-variable. If the two variables have a perfect monotonic relationship, the ranks for the two variables would be either identical (a positive correlation) or reversed (a negative correlation). Spearman’s rank correlation coefficient measures how well these two sets of ranks agree.

In R, the Spearman rank correlation is performed by setting the method parameter of cor() and cor.test():

cor(x, y, method='spearman')
cor.test(x, y, method='spearman')

Two caveats

Spurious correlation of time series

Random walks commonly show a strong and statistically significant correlation (i.e., non-zero) with one another, and the statistical significance increases as the length (n) of the time series increases. As a result, two unrelated time series are often strongly, and significantly, correlated with each another. This behavior often means that the two variables are correlated to a third variable, such as time. When that is true, the data do not constitute a random sample because they are no longer an i.i.d. sequence: the observations are not independent.

For example, the number of ministers and the number of alcoholics are positively correlated through the 1900’s. The number of each largely reflects the growing population during that time, so the two display a correlation, even though they are not directly linked. There are plenty of other examples.

The spurious correlation of two random walks is easily simulated by taking the cumulative sum (or running sum) of two sets of random numbers:

x <- cumsum(rnorm(25))
y <- cumsum(rnorm(25))
cor(x, y)

Because random walks are unrelated, you might expect that a test of correlation of random walks would produce a statistically significant result at a rate equal to α (e.g., 5%). Statistically significant correlations of random walks are actually far more common than that, meaning that the rate of making a type I error can be much larger than your chosen value of α would suggest. You can see this by running the following code several times; p-values less than 0.05 occur far more frequently than you would expect.

x <- cumsum(rnorm(25))
y <- cumsum(rnorm(25))
cor.test(x, y)$p.value

Spurious correlation is a problem not only for time series, but also for spatially correlated data.

You might think that you could address the problem of spurious correlation by collecting more data, but this makes the problem worse because increasing sample size decreases the p-value.

Spurious correlation can be solved by differencing the data, that is, calculating the change from one observation to the next, which will decrease the size of your data set by one observation. Differencing the data removes the time (or spatial) correlation, removing the dependence of a value on the preceding value. Differenced data will not display spurious correlation because differencing removes the trends that cause spurious correlation. In R, use the diff() function to calculate your differences quickly.

xdiff <- diff(x)
ydiff <- diff(y)
cor(xdiff, ydiff)

Any time that you see someone trying to correlate time series or spatial series, always ask if the data were differenced. If they were not, be skeptical of any non-zero correlation.

Here is a simulation that illustrates the problem of spurious time-series correlation. Each plot shows 10,000 random simulations of the correlation coefficient of two sets of data. The upper plot shows the results for two uncorrelated variables, the middle plot shows two random walks (time series), and the bottom plot shows two differenced random walks (time series). Note how commonly two random walks can produce strong negative or positive correlations, and note how differencing returns the expectation to that of two two uncorrelated variables. Follow the Ten Statistical Commandments of Chairman Alroy and Difference Thy Data.

par(mfrow=c(3,1))
 
trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(rnorm(sampleSize), rnorm(sampleSize)))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='Uncorrelated Variables', las=1)
 
r <- replicate(trials, cor(cumsum(rnorm(sampleSize)), cumsum(rnorm(sampleSize))))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='Random Walks', las=1)
# Note that the correlation coefficients are not clumped around zero as they would be for uncorrelated variables
 
r <- replicate(trials, cor(diff(cumsum(rnorm(sampleSize))), diff(cumsum(rnorm(sampleSize)))))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='First Differences of Random Walks', las=1)

Use and interpretation of Pearson and Spearman coefficients

The Pearson correlation coefficient measures the strength of a linear relationship, that is, how well the data are described by a line. In contrast, the Spearman correlation coefficient measures the strength of a monotonic relationship, that is, as one value increases, whether the other consistently increases or decreases, regardless of the amount. Relationships can be monotonic, but not non-linear (for example, an exponential relationship), and as a result, they can have a strong Spearman correlation, but a weak Pearson correlation. This example demonstrates this:

x <- seq(1, 10, by=0.2)
y <- exp(x)
plot(x, y, las=1, pch=16)
cor(x, y, method='pearson')
cor(x, y, method='spearman')

Data can display strong relationships that are neither linear nor monotonic (for example, a sine wave). When this happens, the Pearson and the Spearman correlations may both be near zero, even though there may be a very simple relationship underlying the data. Always plot your data first so that you understand what the relationship looks like, and always use the appropriate measure of correlation to describe your data.

x <- seq(1, 10, by=0.2)
y <- cos(x)
plot(x, y, las=1, pch=16)
cor(x, y, method='pearson')
cor(x, y, method='spearman')

Finally, remember that outliers affect a Pearson correlation coefficient more than they influence Spearman correlation coefficient. In the example below, one outlier (in red) is added to the data. Notice how much the Pearson correlation increases compared to the Spearman correlation increases. Again, plot your data first to know if outliers are an issue.

x <- rnorm(10)
y <- rnorm(10)
plot(x, y, pch=16, xlim=c(-3,5), ylim=c(-3,5))
cor(x, y)
cor(x, y, method='spearman')
 
# Add one outlier
x <- c(x, 5)
y <- c(y, 5)
points(x[11], y[11], pch=16, col='red')
cor(x, y)
cor(x, y, method='spearman')