Multivariate data should be set up in a data frame, with variables in columns and cases in rows. There should be no missing values: every variable should have a value for every case.
As with any data set, the first step in analyzing multivariate data is to plot the data. Strip charts and histograms remain the standard tools for plotting distributions. Bivariate plots are useful for assessing correlations of variables, but the number of such comparisons grows rapidly with the number of variables. The pairs() command is an efficient way to generate all possible pairwise plots. The data points may be small and compressed, but you can readily assess relationships among the variables. Setting the gap parameter to zero removes the space between plots. Shrinking the size of the labels by setting the cex.labels parameter to less than 1 can also help with crowding.
pairs(myDataFrame, gap=0, cex.labels=0.9)
Where multivariate data consist of one dependent and many independent variables, developing a linear model should be a priority. Linear models may include continuous variables and categorical variables (factors) in any combination. Slopes for continuous variables are intuitive, but the slope for factors is the difference between the group means.
Despite the name, linear models do not necessarily produce a straight line. For example, the equation y = ax + bx2 + cex is a linear model even though it is not a line because new variables could substitute for x2 and ex such that y = ax + bw + cz. Such a model is said to be linear in its parameters and linear in its random variables.
The function lm() is the standard tool in R for linear modeling, and it is extremely flexible. The first parameter to the lm() function is the model, and the simplest model is y ~ x. In this notation, the dependent variable is always to the left of the tilde (~), and the independent variable is on the right side. The model y ~ x is read “y as a function of x”.
More complicated models can made by specifying additional independent variables. The syntax of the model specification can initially be confusing because it uses the symbols of arithmetic, but without their arithmetic meanings. For example, an additional variable is added to the model by using the + operator: y ~ x + z. In this context, + does not indicate addition, but means include an explanatory variable. With the + operator, you can include many explanatory variables in your model: y ~ x + z + w + q.
In some situations, there may be interactions between explanatory variables. For example, the effect of z may be different at high values of x than at low values. Such interactions can be included in two ways. The simplest is to use the : operator, as in y ~ x + z + x:z, which reads as y as a function of x, z, and the interaction of x and z. Equivalently, you could also use the * operator: y ~ x * z. The * operator does not mean multiplication in this context, but means include an explanatory variable and all of its interactions. When there are more than two explanatory variables, the combination of all possible interactions may be quite large. For example y ~ x * z * w is the same as y ~ x + z + w + x:z + x:w + z:w + x:z:w. The ^ operator can be used to find all interactions up to a given level of complexity. For example, (x + z + w)^2 will include all interactions involving up to two explanatory variables at a time, or x + z + w + x:z + x:w + z:w. To include interactions with up to three explanatory variables, you would write (x + z + w)^3, which would expand to x + z + w + x:z + x:w + z:w + x:z:w, the same as x*z*w.
Models can be nested. If you want to model y as a function of x and z within x, use the / operator: y ~ x / z. In this context, / indicates nesting, not division.
Models may be conditional, by using the | operator. For example y ~ x | z is read as y as a function of x given z.
During model simplification, discussed below, the - operator is used to identify a term that should be removed. Again, - does not mean subtraction, but to remove an explanatory variable. Regressions may be forced through the origin by adding -1 to the model. The model y ~ 1 finds the grand mean of y, so appending -1 to a model removes this term and forces the regression through the origin.
Polynomial regression can be accomplished through the poly() function, as in y ~ poly(x,3), which will fit a cubic polynomial in x, in other words, y = b0 + b1x + b2x2 + b3x3.
Because the arithmetic operators have special meaning when used as specifying a linear model, the identity function I() must be used when these operators are used in their arithmetic sense. For example, if you wanted to find the parameters for the equation y = a + bx + cx2, you would write the linear model as y ~ x + I(x^2). The I() function tells R to treat what is in the parentheses (x^2) as the arithmetic statement x2. Similarly, if you wanted to fit the equation y = a + b/x, you would state the linear model as y ~ I(1/x).
As the number of explanatory variables increases, the number of possible linear models rises dramatically, particularly when you consider interaction and higher-order terms. The goal of linear modeling is to find a model that is relatively simple yet explains as much variation as possible.
Finding an appropriate model for a given data set takes time, and several things must be kept in mind. First, there is no single model for explaining a given set of data. Second, different strategies for choosing a model may lead to different results. Third, the order in which variables are added or removed can also affect the final result. In all these cases, your understanding of the science must guide your decisions. For example, if two approaches lead to different models, choose the one that makes more sense in terms of the underlying processes.
The simplest approach to finding the best model is to try all possible regressions and compare them with some metric. You could choose the model with the largest coefficient of multiple determination (R2), the smallest mean square error term, or the largest adjusted R2, which accounts for the number of parameters in the model.
More commonly, the best model is found by either starting with a simple model and adding more predictor variables, called forward selection, or by starting with the most complicated model and removing predictor variables, called backward elimination.
In forward selection, predictor variables are added until there is no substantial increase in the coefficient of determination (R2). The principal problem is determining the order in which to add variables, with one intuitive approach being to add the variable that produces the greatest increase in R2, which is usually determined by calculating partial correlation coefficients. A second approach is by calculating the F-statistic corresponding to the addition of a new variable (called F-in or F-to-enter), adding the variable that produces the largest F-statistic, provided the F-statistic meets some minimum requirement of significance. Variables are no longer added when none of the remaining variables produces a significant F-statistic, in other words, when adding another explanatory variable does not substantially increase explained variance.
In backward elimination, all predictor variables are initially included in the model. Predictor variables are dropped sequentially, provided they do not substantially lower the coefficient of determination (R2). This is typically done by removing the variable with the smallest partial correlation coefficient or by removing the variable with the smallest F-statistic (called F-out or F-to-remove), provided that F is not significant. A general approach is to start removing the highest-order interaction terms first, usually those that are the least significant.
The problem with both of these approaches is that all predictor values are kept even though a variable may add little to R2 once other variables have been added via forward selection. Likewise a variable may not be included in the backward elimination model, even though the removal of other variables may now make it contribute substantially to R2. Stepwise regression solves this problem. In forward selection, the list of included variables is re-evaluated after a predictor variable has been added to see if one of the included variables no longer adds substantially to R2. After each variable is removed in backward elimination, the list of removed variables is scanned to see if any of those variables would now add substantially to R2.
In both forward selection and backward elimination, the models produced are nested in that the simpler model always contains a subset of the parameters of the more complicated model. Nested models can be compared in R with the anova() function. The idea is that the difference between the two models has a quantity called the extra sum of squares, equal to the reduction in the unexplained sum of squares produced by the additional model terms. This can be converted to a variance, which can be scaled against the unexplained variance term for the full model. A ratio of variances can be tested with an F-test, hence this problem can be solved with an ANOVA. Testing for the significance of additional variables between two models is done like this:
model1 <- lm(y~x)
model2 <- lm(y~x+z)
A significant result in the ANOVA indicates that the additional parameters significantly decrease the unexplained variation and should be included in the model. In forward selection, you would include the parameters of the more complicated model if they did produce a statistically significant result in the ANOVA. In backward elimination, you would remove the parameters of the more complicated model if they did not generate a statistically significant result in the ANOVA.
Model simplification can be time-consuming, with difficult and subjective choices along the way. The step() function can be used to automate the process, based on Akaike’s Information Criterion (AIC).
Non-nested models can also be tested with AIC, using the AIC() and stepAIC() functions. The latter is particularly useful because it can automate the entire process. Start by constructing the most complicated model, then call stepAIC() on the results of that model to see how it could be simplified by using AIC. Both functions require the MASS library.
mostComplicatedModel <- lm(y ~ x * z * w)
When fitting linear models, watch for multicollinearity among the independent variables, that is, where independent variables are highly correlated. Often, this may indicate that two or more variables measure the same quantity. Including highly correlated independent variables can make model selection more difficult and can complicate the interpretations of regression coefficients. When you have multiple highly correlated variables, it is often best to use only one of them, usually the one that is most strongly correlated with the dependent variable.
Once a model has been selected, it must be evaluated through the various diagnostic plots available in plot.lm(). Three aspects of the regression must be checked. First, verify that the errors are normally distributed, usually with the qqnorm() plot of the residuals. Second, verify that the residuals do not change systematically with the fitted values. Third, make sure that there are no points that have an unduly large influence over the regression, as measured by Cook’s distance. Cook’s distance is measured as the square of the difference in the slope measured with all of the points, relative to the slope measured without the point in question. Points with large values of Cook’s distance may indicate measurement errors or some other problem.
Dalgaard, P. 2002. Introductory statistics with R. Springer-Verlag, New York.
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W.H. Freeman and Company, New York.
Verzani, J. 2005. Using R for introductory statistics. Chapman & Hall / CRC Press, Boca Raton, Florida.