22 October, 2009

R comes with a rich set of probability distributions, and four consistent ways of accessing them: r, d, q, and p. If you combine these with the name of the distribution - rnorm(), dnorm(), qnorm(), and pnorm(), for example - you can generate random numbers, find the value of the function at a specific point, produce a critical value, and calculate a p-value. Let's look at each of these functions, but we'll use the F distribution.

When you want to generate random numbers, use the r prefix, as in rf() for the F distribution. Most distributions require one or more parameters to define the shape of the distribution. For the F distribution, you need the degrees of freedom for the numerator (df1) and the denominator (df2). To generate n=50 random numbers, you would type

rf(n=50, df1=15, df2=20)

If you generate many random numbers and plot a histogram, you can see the approximate shape of that F distribution.

hist(rf(100000, df1=15, df2=20), breaks=100, col="gray", xlab="F", main="")

The d prefix, such as df(), is useful for finding the value of the F function, called its density, at any value of the statistic. In other words, d gives you the height of the probability distribution. Again, for the F distribution, you need to supply the degrees of freedom for the numerator and the denominator, and in this case you need to specify the value of F for which you want the F function. For an F value of 2.3, with 15 and 20 degrees of freedom, the F function is given by:

df(x=2.3, df1=15, df2=20)

You can use this to calculate likelihoods, and you can use it to plot the exact curve for a function.

F <- seq(from=0, to=6, by=0.01)

dF <- df(x=F, df1=15, df2=20)

plot(F, dF, type="l", lwd=3, col="blue")

This is useful if you ever want to visualize the shape of a probability distribution.

The q prefix, as in qf(), is how you obtain critical values from a probability distribution. By default, R will integrate the probability starting from the left tail. If you want to get the probability starting from the right tail, be sure to set lower.tail=FALSE. For an F distribution, you can get a right-tailed critical value by specifying the degrees of freedom, a probability (our alpha value for a critical value), and the lower tail.

alpha = 0.05

qf(p=alpha, df1=15, df2=20, lower.tail=FALSE)

To get two-tailed critical values, you'll need to do them separately and its clearer if you set lower.tail in both cases (TRUE for the lower critical value, and FALSE for the upper critical value).

alpha = 0.05

lowerCriticalValue = qf(p=alpha/2, df1=15, df2=20, lower.tail=TRUE)

upperCriticalValue = qf(p=alpha/2, df1=15, df2=20, lower.tail=FALSE)

Critical values are most commonly needed for generating confidence intervals.

statistic = 3.2

standardError = 1.7

myDf = 22

alpha = 0.05

tcrit = qt(alpha/2, df=myDf, lower.tail=FALSE)

lowerConfidenceLimit = statistic - tcrit * standardError

upperConfidenceLimit = statistic + tcrit * standardError

The p prefix, as in pf(), is how you calculate a p-value from a probability distribution. As for the q prefix, R will always integrate starting with the lower tail. If you want the right tail, specify lower.tail=FALSE. Specify the value of F for which you want a p-value, the degrees of freedom, and whether you want the left tail or the right tail.

observedF = 3.7

pf(q=observedF, , df1=15, df2=20, lower.tail=FALSE)

To get a two-tailed p-value, simply double your probability.

2 * pf(q=observedF, df1=15, df2=20, lower.tail=FALSE)

Notice the symmetry in pf() and qf(): pf() takes q (the statistic) as its first parameter, and qf() takes p (the probability) as its first parameter.