Steven Holland

# Generalized Linear Models

Linear models, as produced by the lm() function, make several critical assumptions: normally distributed errors, constant variance, and no trends in residuals relative to fitted values. When these assumptions will be violated, you must use a different modeling procedure, such as Generalized Linear Models. Generalized linear models are implemented in R with the glm() function, which has the same syntax for model description as lm(). Generalized linear models use likelihood methods and are fundamentally different in their approach to regression than least-squares regression.

Generalized linear models have three important components: the error structure, the linear predictor, and the link function.

The error structure describes how the errors are distributed. Count data, represented as integers with a minimum value of zero, have Poisson-distributed errors. Proportions, which range from zero to one, have binomial-distributed errors. Data with a constant coefficient of variation, where standard deviation is a multiple of the mean, have gamma-distributed errors. Data on time-to-death, such as survival analysis, have errors that follow an exponential distribution.

The linear predictor is the sum of the parameters (slopes), each multiplied by the values of the explanatory variables. It is formed by the right side of the linear model formula.

The link function is a function that relates the mean value of y to the linear predictor and may be the most confusing aspect of a generalized linear model. Every error structure has an associated link function. The normal distribution has a link function called the identity function, which means that the y-value is equal to the linear predictor. In other words, nothing must be done to relate the y-value to the linear predictor. In glm(), the family parameter specifies both the error distribution and link function. For example, to set a gaussian error model and an identity link function, just set the family to be gaussian

glm(y ~ x, family=gaussian)

This will produce the same results as if you had called lm(), although glm() calculates the regression coefficient with likelihood methods, not by minimizing the sum of squares as lm() does.

Here are the most common settings for family in glm() and their associated link functions. Notice that only Gamma is capitalized.

## Count data

Count data have several important properties. First, count data have a minimum value of zero and are therefore said to be bounded below. Count data are integers, which affects their error distribution. Last, count data follow a Poisson distribution in which variance is not constant, but increases with the mean.

A generalized linear model for count data requires the poisson family and would therefore be written as

glm(y~x, family=poisson)

This model is equivalent to the equation

which can be rewritten as

To fit a generalized linear model to count data, put the counts in one vector and the explanatory vector in another. Use the predict() function to generate points for the function curve to be plotted. The predict() function can be used with all types of regression to display the regression curve on a plot.

# simulating data
speciesA <- c(25,23,20,19,12,10,7,10,4,9,7,8,3,1,5,2,0,1,1,0)
habitat <- c(14,16,17,18,18,18,18,20,20,20,21,21,21,21,22,22,23,23,24,25)
plot(habitat, speciesA, las=1, pch=16, col='green')

# fitting a model
model <- glm(speciesA~habitat, family=poisson)

# adding fitted model to data plot
habitatPoints <- seq(min(habitat), max(habitat), by=0.01)
fittedValues <- predict(model, list(habitat=habitatPoints), type='response')
lines(habitatPoints, fittedValues, lwd=3, col='blue')

## Proportion and binary data

Proportion data have three important properties. Values are constrained to lie between 0 and 1, inclusive, and are therefore said to be strictly bounded. The errors are non-normally distributed and follow a binomial distribution. Finally, variance is not constant, but reaches a minimum when the proportions are near zero and one and reaches a maximum when the proportion is 0.5.

Proportion data and binary data require the binomial family, which uses a logit link function. The logit function is equal to log(p/(1-p)), also called the log-odds, where p is the proportion. The model for proportion or binary data is stated as

glm(p~x, family=binomial)

This model specifies the relationship

which can be rewritten in terms of p as:

The first example will be for proportional data. To preserve sample size, the data for calculating the proportions should be stored in two vectors. Because this will model the proportion of a community represented by one species, the abundance of that species in recorded one vector (speciesA), and the abundance of all other species combined are stored in another vector (otherSpecies). They are combined into a matrix with the cbind() command, which will be used as the response variable in the generalized linear model.

# simulating data
speciesA <- c(25,23,20,19,12,10,7,10,4,9,7,8,3,1,5,2,0,1,1,0)
otherSpecies <- c(15,20,35,25,42,25,34,28,38,30,35,34,36,40,45,37,41,43,40,46)
proportion <- speciesA/(otherSpecies+speciesA)
y <- cbind(speciesA, otherSpecies)
habitat <- sort(rnorm(n=20, mean=20, sd=3))
plot(habitat, proportion, las=1, pch=16, col='green')

# fitting a model
model <- glm(y~habitat, family=binomial)

# adding fitted model to data plot
habitatPoints <- seq(min(habitat), max(habitat), by=0.01)
fittedValues <- predict(model, list(habitat=habitatPoints), type='response')
lines(habitatPoints, fittedValues, lwd=3, col='blue')

Alternatively, you could put the proportions in a vector, and the sample sizes in another vector. Using the proportions as the response variable in the model formula, the weights parameter would be set equal to sample size.

proportion <- speciesA/(otherSpecies+speciesA)
sampleSize <- speciesA + otherSpecies
habitat <- sort(rnorm(n=20, mean=20, sd=3))
model <- glm(proportion~habitat, family=binomial, weights=sampleSize)

The second example will illustrate binary (presence/absence) data, such as the occurrence of a species along a habitat transect, which are recorded as 1’s (present) and 0’s (absent).

# simulating data
present <- c(1,1,1,1,1,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0)
habitat <- sort(rnorm(n=22, mean=20, sd=3))
plot(habitat, present, las=1, pch=16, col='green')

# fitting a model
model <- glm(present~habitat, family=binomial)

# adding fitted model to data plot
habitatPoints <- seq(min(habitat), max(habitat), by=0.01)
fittedValues <- predict(model, list(habitat=habitatPoints), type='response')
lines(habitatPoints, fittedValues, lwd=3, col='blue')

## Survival data

Survival data, also called time-to-death and time-to-failure data, have variance that is nearly always non-constant. Where these data have errors that follow a gamma distribution, variance is proportional to the square of the mean, that is variance increases more rapidly than the mean. Gamma errors require a reciprocal link function (1/y), and their model is written as:

glm(y~x, family=Gamma)

This model specifies the relationship

or, alternatively

Survival data can be fitted with the time-to-death data in one vector and the explanatory vectors in another. The function calls and plotting are similar to those above, but with family=Gamma.

## References

Dalgaard, P. 2002. Introductory statistics with R. Springer-Verlag, New York.

Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W.H. Freeman and Company, New York.

Verzani, J. 2005. Using R for introductory statistics. Chapman & Hall / CRC Press, Boca Raton, Florida.