Lecture Notes

Home

Contact

Steven Holland

Regression

When you need to know the mathematical form of a relationship between two or more variables, regression is the statistical method for finding that relationship. You might use the equation returned from the regression solely for description. You might also use it for prediction, that is, predicting the dependent (y) variable given the value of the independent (x) variable. You might also want to test hypotheses about the slope and y-intercept. Although correlation and regression are commonly confused, regression describes the mathematical relationship and correlation describes its strength. There are several ways to fit a line to data and this lecture will cover some of the most common.

Model 1: Least-squares regression

Least-squares regression is the most common method for fitting a line, and for many statistical packages, it is the only way. It is so common that many scientists are unaware that other methods even exist. The linear mathematical model for the relationship between two variables, an independent X variable and a dependent Y variable is described by the following equation

least squares linear model

where slope is β1 and the y-intercept is β0. The value of any observed Yi is equal to the slope multiplied by the value of Xi plus the y-intercept, plus an error term εi. The error term represents other factors besides X that influence the value of Y, including measurement errors and other unmeasured variables. Error in X is assumed to be minor, meaning that you set X and measure the resulting Y; in other words, that X is under your direct control, but Y is not. Errors in Y are assumed to be normally distributed about the regression line, an assumption that you must test.

The method of least-squares fits a line that minimizes the sums of squares of the residuals, which are the differences between the observed values of Yi and the predicted Ŷi, predicted from Xi and the regression coefficients. For that reason, it is named least-squares regression. The least-squares regression can be used for description, prediction, and hypothesis testing.

The slope of a least-squares regression is calculated as:

least-squares slope

Depending on your circumstances, you may wish to calculate it as the sum of products divided by the sum of squares in x, by the covariance divided by the variance in x, or as the correlation coefficient multiplied by the ratio of standard deviations. All are equivalent.

The Y-intercept of a least-squares regression is calculated as:

least-squares intercept

where Y-bar and X-bar are the mean of Y and X, known as the centroid. The regression line is constrained to pass through the centroid of the data.

Everything to this point is descriptive, in that the statistics for slope and intercept are calculated, but no inferences are made about the population. If you wish to make statistical inferences about the parameters (the slope and intercept of the population), there are several questions you could ask, such as:

In evaluating the regression, we would also want to ask:

Regression in R

You can perform a regression in R with the lm() function (think linear model). The lm() function is exceptionally powerful, and for now I will demonstrate only a simple linear regression of two variables. First, I will simulate some data for a line of known slope and intercept, so that we can evaluate the results of the regression. We model a slope of 2.15 and a y-intercept of 4.64, and we generate 30 points conforming to that relationship, where those points have normally distributed errors around that line.

x <- rnorm(n=30, mean=15, sd=3)
errors <- rnorm(n=30, sd=2)
slope <- 2.15
intercept <- 4.64
y <- slope*x + intercept + errors

As always, we start by plotting the data to check that the relationship looks linear and that there are no outliers in either x or y.

plot(x, y, pch=16)

Next, we run the regression with the lm() function. In this example, y is modeled as a linear function of x. Assign the results of the regression to an object so you can fully analyze the results.

myRegression <- lm(y ~ x)

Read the notation lm(y~x) as “a linear model, with y as a function of x”. Other relationships can be modeled besides a simple linear one. For example, if you wanted y as a function of x2, you could write lm(y~x^2), and so on for other relationships.

The myRegression object contains the statistics for the y-intercept (called Intercept) and slope (called x here, but in general is called whatever the predictor variable is named).

> myRegression
 
Call:
lm(formula = y ~ x)
 
Coefficients:
(Intercept)        x
      6.914    2.068

Use the names() function to view everything contained in the myRegression object. You can access any of these by name or by position in the vector. For example, to retrieve the residuals, you could use myRegression$residuals or myRegression[2]. Calling by name is safer and more self-explanatory, with the only drawback being that it is somewhat more to type.

> names(myRegression)
 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "xlevels"       "call"          "terms"         "model"

Hypothesis tests

You can perform hypothesis tests on the regression with the summary() function.

> summary(myRegression)
  Call:
lm(formula = y ~ x)
 
Residuals:
    Min       1Q   Median       3Q      Max
-4.6106  -1.0806   0.3249   1.0416   4.2920
 
Coefficients:
           Estimate   Std. Error   t value   Pr(>|t|)
(Intercept)  6.9137       2.2027     3.139    0.00397 **
x            2.0676       0.1404    14.726   1.03e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Residual standard error: 1.867 on 28 degrees of freedom
Multiple R-squared: 0.8856, Adjusted R-squared: 0.8816
F-statistic: 216.8 on 1 and 28 DF, p-value: 1.034e-14

There’s a lot of information in the output, so let’s step through this. First, the function call is displayed, so you know what was calculated.

Next shown are the residuals, and here you can evaluate the assumption of whether the residuals are normally distributed. Since the errors are assumed to be centered about the regression line, the median residual should be close to zero. The errors are assumed to be normally distributed, so the first quartile (1Q) and third quartile (3Q) residuals should be about equally distant from zero. Likewise, the minimum (Min) and maximum (Max) residual should be roughly equally distant from zero. In this case, everything looks roughly balanced, although the median is 0.3, which might be a warning sign than the residuals may be skewed.

The coefficients of the regression line are shown next, along with tests of the null hypotheses (i.e., slope equals zero and intercept equals zero). The estimates of the intercept and slope are shown in the second column (here, 6.9137 and 2.0676). The labeling of slope as x may seem confusing at first, but in a multiple regression, where you regress one dependent variable (y) against multiple independent variables, such as Iron, Chloride, and Phosphorous, the slope for each independent variable will appear on its own line, labeled with their name.

Both the intercept and the slope follow a t-distribution, provided that there is random sampling and the errors are normally distributed. To run a t-test against null hypotheses of zero intercept and zero slope, you need the standard errors of the intercept and slope (third column) to calculate a t-statistic (fourth column). The corresponding p-values are shown in the last column. For convenience, these are labeled with asterisks to indicate their statistical significance, with *** indicating significance at the 0.001 level, ** indicating significance at the 0.01 level, * indicating significance at the 0.05 level, . indicating significance at the 0.1 level, and no symbol indicating significance at greater than 0.1.

The residual standard error is shown on the next line. The standard error of the residuals is calculated as the square root of the sum of squares of the residuals, divided by the degrees of freedom (n-2). The residual standard error gives you a measure of the amount of scatter around the line. You can compare this value to the standard deviation in y to get a sense for the amount of variation in y that is not accounted for by the regression.

The R-squared value indicates the proportion of explained variation, that is, the variation in y that is explained by the variation in x. It is calculated as the square of the correlation coefficient. If all of the y-values fell exactly on the regression line, x would be a perfect predictor of y and it would be said to explain all of the variation in y. If the points formed a nearly circular shotgun blast with a correlation coefficient near zero, the percent of explained variation (R2) would be very close to zero, indicating that the regression explains almost nothing about the variation in y. R2 is an important statistic and should always be reported; it is the best single number to describe the strength of a linear relationship. The Adjusted R-squared reflects a small penalty correction for the number of included parameters (independent variables). As you add more independent variables, Adjusted R-squared will become progressively smaller than the R2 value.

The test of the significance of the regression itself is shown on the last line as an F-test, a ratio of variances. This tests whether the regression variance accounted for by the regression is greater than variance in the residuals. If the regression explains none of the variation in Y, these two variances should be equal. The low p-value here tells you that the regression is highly significant: it explains much of the variance in y. The degrees of freedom are 1 for the numerator (the regression) and 28 (n-2) for the denominator. Bear in mind that this F-test will tell you only whether there is a statistically significant result. Like all p-values, this is strongly influenced by sample size and tells you nothing about effect size and our uncertainty in it. To determine effect size, examine the size of the slope and use its standard error to calculate a confidence interval.

It is straightforward to get confidence intervals on the slope and intercept. Because you know the standard errors, you could also do this longhand.

confint(myRegression)

Evaluating the assumptions

Before using our results, you should evaluate several aspects of the regression and you can do this with the plot() command, which presents four useful plots, one at a time:

plot(myRegression)

The first plot shows residuals (ε) against the fitted values (Ŷ). Two aspects are critical. First, the residuals should be centered around zero (shown by the gray dashed line) for all fitted values; the red line showing the trend of the residuals should be close to the gray dashed line. Second, the variance of the residuals should not change with fitted values: the cloud of points should not widen to the left or to the right.

Three points marking the three largest residuals will be identified; the number corresponds to the row of the data frame. Watch for the numbered points in all of four plots to identify outliers. Some of these might indicate data recording or acquisition errors, such as an erratic reading from an instrument. Points that reflect such errors need to be corrected or removed from the regression, but never simply remove points because they would clean up a regression.

The second plot is a qqnorm() plot. The residuals are assumed to be normally distributed, so they should fall fairly close to the gray dashed line on this plot, a line that indicates perfect agreement between the theoretical and observed quantiles. This match should be especially good in the middle part of the range, as it is common for points near the ends of the range to lie off the dashed line. Strong systematic departures indicate non-normally distributed errors.

The third plot focuses on trends in the sizes of the residuals. The square root of the absolute value of the residuals is shown, which allows one to see trends in the residuals with the fitted values. The residuals should show no trend (i.e., the red line should be relatively flat), and the residuals should not form a triangular region, with an increasing range in the size of residuals as the fitted values increase.

The fourth plot is the standardized residuals versus leverage, giving insight as to which data points have the greatest influence on the estimates of slope and intercept. Note that this is a different plot than indicated in the Crawley text (p. 144), but the goal is the same. Standardized residuals should show no trend with leverage (i.e., the solid red line should lie close to the dotted gray line). Furthermore, no points should have large values of Cook’s distance, which would indicate that an outlier (a point with a large residual) also has strong leverage (ability to pull the regression line in some direction).

Adding the regression to a plot

While our plot is still active, you can add the regression line to it with the abline() function. You can stylize the line as you normally would.

plot(x, y, pch=16)
abline(myRegression, col='red', lwd=2)

Prediction using the regression

In some cases, you would like to use the regression for prediction, that is, you know the independent (predictor) variable and you would like to know the value of the dependent (result) variable. To do this, you need to supply the regression model and a list of values of the independent variable for which you would like predictions.

predict(myRegression, list(x=c(12.2, 13.7, 14.45)))

Model 2 regression

In a model 1 regression, you control one variable (x) and measure the response variable (y). Lab experiments are examples of this. In other situations, you do not control either variable, and often in these cases, it is not clear which variable would be treated as x and which would be treated as y. The order matters, because a regression of y on x produces a different line than a regression of x on y. An example in which you would not control either variable would be if you collected rocks and made two different measurements of those rocks. In this case, you do not control either value; you can only measure what is there. When you do not control one of the variables, both variables have measurement error and you must perform a model 2 regression. Model 2 regressions will allow us to describe the relationship and to test some hypotheses, but they cannot be used for prediction.

A model 2 regression accounts for the uncertainty in both x and y by minimizing the errors in both directions. There are several ways to do this. In a major axis regression, what is minimized is the perpendicular distance from a point to the line. In standard major axis (SMA) regression (also called reduced major axis or RMA regression), the areas of the triangles formed by the observations and the regression line are minimized. The standard major axis regression is particularly common. The slope of a SMA regression is:

SMA slope

The sign is listed as plus or minus because it is set to match the sign of the correlation coefficient. The slope can be calculated as the ratio of the standard deviations or as the square-root of the ratio of the sum of squares, whichever is more convenient.

The SMA y-intercept is calculated as it is for the least-squares regression, that is, the line must pass through the centroid.

Functions for the SMA slope and intercept are straightforward.

smaSlope <- function(x,y) {
   b1 <- sd(y)/sd(x)
   if (cor(x,y)<0) b1 <- -b1 # change sign for negative correlation
   b1
}
 
smaIntercept <- function(x,y) {
   b1 <- sd(y)/sd(x)
   if (cor(x,y)<0) b1 <- -b1 # change sign for negative correlation
   b0 <- mean(y) - mean(x)*b1
   b0
}

The SMA slope equals the least-squares slope divided by the correlation coefficient and is therefore always steeper than a least-squares slope. The difference in these two slopes decreases as the correlation becomes stronger. As the correlation between two variables weakens, the slope of an SMA regression approaches 1.0, whereas it approaches 0 in a least-squares regression.

Standard errors are available for the SMA slope and intercept (Kermack and Haldane 1950, Miller and Kahn 1962, and see acknowledgements below). From these, you can calculate confidence intervals on the slope and intercept, using n-2 degrees of freedom.

standard error of SMA slope standard error of SMA intercept

The lmodel2 package can run a variety of model 2 regressions, plot them, and perform statistical tests. After loading that library, running vignette('mod2user') will display an outstanding pdf on best practices, particularly the appropriate circumstances for each type of model 2 regression. If you think you might need a model 2 regression, read this pdf.

Adding a trend line to a plot

In some cases, you may wish to add a trend line through the data solely to describe any trends, which may not fit any prescribed function.

You can show how these could be used for a plot of temperature versus day of year, which displays a sinusoidal relationship. For this, use the UsingR library that accompanies the Crawley textbook.

library(UsingR)
attach(five.yr.temperature)

The function scatter.smooth() plots the data and the trend line in one step. The col parameter sets the color of the data points.

scatter.smooth(temps ~ days, col='gray')

There are two ways you can add a trend line to an existing plot. The first uses the smooth.spline() function to calculate the trend line and the lines() function to add it to the plot. You can use the lty parameter to dash the trend line and the lwd parameter to bold it.

plot(days, temps, col='gray')
lines(smooth.spline(temps ~ days), lty=2, lwd=2)

You can use the supsmu() function to make a Friedman’s SuperSmoother trend line, and again display it with the lines() function. Note that the syntax for calling supsmu() differs from that of smooth.spline() or performing a regression.

plot(days, temps, col='gray')
lines(supsmu(days, temps), lty=3, lwd=2)

You can show all three trend lines on one plot for comparison. The default scatter.smooth() function produces the smoothest trend line of the three. You can control the smoothness of any of these trend lines by adjusting one of their parameters. See the help pages for more details, in particular the span parameter for scatter.smooth(), the spar parameter for smooth.spline(), and the bass parameter for supsmu().

scatter.smooth(temps ~ days, col='gray')
lines(supsmu(days, temps), lty=3, lwd=2)
lines(smooth.spline(temps ~ days), lty=2, lwd=2)
legend(locator(1), lty=c(1,2,3), lwd=c(1,2,2), legend=c('scatter.smooth', 'smooth.spline', 'supsmu'))

The last line of this code adds a key with the legend() function. In it, you specify the types of lines (lty), their weight (lwd), and their labels (legend). The locator(1) function lets you click where you want the upper left corner of the legend box to be placed on your plot. You could instead specify the (X,Y) coordinates of the upper left corner - see the help page for legend() for instructions. The legend() function can be used on any type of plot with different kinds of points.

Acknowledgements

David Goodwin kindly brought to my attention an error in Davis (1986, 2002) for the equation of the standard error of the intercept in a SMA regression. Davis did not include the final term in the radicand, but it should be. The correct equation is given in Kermack and Haldane (1950) and Miller and Kahn (1962), both cited by Davis. The correctness of this equation over Davis’ version can be demonstrated with a bootstrap.

References

Davis, J.C., 1986. Statistics and Data Analysis in Geology, 2nd edition. Wiley: New York.

Davis, J.C., 2002. Statistics and Data Analysis in Geology, 3rd edition. Wiley: New York, 638 p.

Kermack, K.A., and J.B.S. Haldane, 1950. Organic correlation and allometry. Biometrika 37:30–41.

Miller, R.L., and J.S. Kahn, 1962. Statistical Analysis in the Geological Sciences. Wiley: New York, 483 p.