Steven Holland

# Estimating Diversity

14 November 2015

We often want to know how many species are present, whether it is in a particular habitat, a province, or an interval of geological time. What sounds like a simple matter of counting species becomes complicated because the number of species increases with the number of individuals that are counted.

This becomes a problem when comparing the diversity (richness) of different samples and those samples have different numbers of counted individuals. A common approach is to use rarefaction (Hurlbert 1971) to subsample all of the samples and calculate their richness at the same number of individuals.

This also sounds straightforward, but it becomes complicated by evenness, that is, how abundance decreases from the most abundant species to the least abundant. When evenness is high, all species have relatively similar abundances, and when evenness is low, the abundances of species differs greatly. The problem for rarefaction is that rarefaction curves can cross when two samples differ in their evenness, and therefore the relative richness of samples will change at different sampling levels.

The second problem with rarefaction is what constitutes fair sampling. Although it may sound fair to count all samples to the same number of individuals, this can be misleading. An alternative approach is to count individuals until a given level of the relative abundance distribution has been sampled, for example, sampling might continue until the combined relative abundances of the sampled species equals some specified value. In effect, this means that samples with low evenness will need to be counted to a larger number of individuals to achieve the same coverage of relative abundance. John Alroy’s Shareholder Quorum Subsampling (SQS) performs this fair sampling (Alroy 2010).

Below, I give an implementation of SQS, based on John’s code. The implementation here removes the rarefaction option and the calculation of other metrics not needed for SQS. It also more closely follows R’s conventions of variable naming, returns and error handling, and avoidance of loops. I’ve also added more comments so that the steps in the calculations are clearer. The code retains one large loop, which is decidedly not R-like, but using this loop is simpler than the alternatives, given the large number of required parameters. The code is also available as a download.

sqs <-function(abundance, quota=0.9, trials=100, ignore.singletons=FALSE, exclude.dominant=FALSE) {
# abundance is a vector of integers representing the abundance of every species

if ((quota <= 0 || quota >= 1)) {
stop('The SQS quota must be greater than 0.0 and less than 1.0')
}

# compute basic statistics
specimens <- sum(abundance)
numTaxa <- length(abundance)
singletons <- sum(abundance==1)
doubletons <- sum(abundance==2)
highest <- max(abundance)
mostFrequent <- which(abundance==highest)[1]

if (exclude.dominant == FALSE) {
highest <- 0
mostFrequent <- 0
}

# compute Good's u
u <- 0
if (exclude.dominant == TRUE) {
u <- 1 - singletons / (specimens - highest)
} else {
u <- 1 - singletons / specimens
}

if (u == 0) {
stop('Coverage is zero because all taxa are singletons')
}

# re-compute taxon frequencies for SQS
frequencyInitial <- abundance - (singletons + doubletons / 2) / numTaxa
frequency <- frequencyInitial / (specimens - highest)

# return if the quorum target is higher than estimated coverage
if ((quota > sum(frequency)) || (quota >= sum(abundance))) {
stop('SQS quota is too large, relative to the estimated coverage')
}

# create a vector, length equal to total number of specimens,
# each value is the index of that species in the abundance array
ids <- unlist(mapply(rep, 1:numTaxa, abundance))

# subsampling trial loop
richness <- rep(0, trials) # subsampled taxon richness
for (trial in 1:trials) {
pool <- ids # pool from which specimens will be sampled
specimensRemaining <- length(pool) # number of specimens remaining to be sampled
seen <- rep(0, numTaxa) # keeps track of whether taxa have been sampled
subsampledFrequency <- rep(0, numTaxa) # subsampled frequencies of the taxa
coverage <- 0

while (coverage < quota) {
# draw a specimen
drawnSpecimen <- sample(1:specimensRemaining, size=1)
drawnTaxon <- pool[drawnSpecimen]

# increment frequency for this taxon
subsampledFrequency[drawnTaxon] <- subsampledFrequency[drawnTaxon] + 1

# if taxon has not yet been found, increment the coverage
if (seen[drawnTaxon] == 0) {
if (drawnTaxon != mostFrequent && (ignore.singletons == 0 || abundance[drawnTaxon] > 1)) {
coverage <- coverage + frequency[drawnTaxon]
}
seen[drawnTaxon] <- 1

# increment the richness if the quota hasn't been exceeded,
# and randomly throw back some draws that put the coverage over quota
if (coverage < quota || runif(1) <= frequency[drawnTaxon]) {
richness[trial] <- richness[trial] + 1
} else {
subsampledFrequency[drawnTaxon] <- subsampledFrequency[drawnTaxon] - 1
}
}

# decrease pool of specimens not yet drawn
pool[drawnSpecimen] <- pool[specimensRemaining]
specimensRemaining <- specimensRemaining - 1
}
}

# compute subsampled richness
s2 <- richness[richness>0]
subsampledRichness <- exp(mean(log(s2))) * length(s2)/length(richness)
return(round(subsampledRichness, 1))
}

## Using SQS

Using the SQS function is simple: just supply it a vector of species abundances, in any order.

abundances <- c(2, 1, 12, 2, 5, 3, 25, 2, 1, 17, 1, 9)
sqs(abundances)

>## [1] 8.5

The quota argument is set to 0.9 by default, which means that sampling will continue until taxa whose combined relative abundance exceeds 90%. This quota could be set to lower levels. Although John does not provide guidelines as to what the quota (quorum) should be, he does argue that the final results are not dependent on its value, unlike rarefaction (Alroy 2010, p. 1217). In his examples, John sets the quota to 40% and 55%.

The number of trials will control the precision of the answer, with the tradeoff being computation time. By default, 100 trials are performed. This value could be increased, but shouldn’t be decreased except when testing the method.

The remaining two parameters control whether the method includes singletons (species occurring only once in a sample) and dominants (the most abundant taxon). John argues (Alroy 2010, p. 1217) that these are special cases that should rarely effect the result. One might set ignore.singletons to true if one sample was unusually intensely sampled compared to other collections. One might set exclude.dominant to true to avoid having dominance affect the size of the species pool. John mentions that both corrections are relevant in idiosyncratic and small data sets and that they decrease sampling error in larger data sets.

## References

Alroy, J. 2010. Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification. Palaeontology 53:1211–1235.

Hurlbert, S.H. 1971. The nonconcept of species diversity: a critique and alternative parameters. Ecology 52:577–586.