Cluster analysis includes two classes of techniques designed to find groups of similar items within a data set. Partitioning methods divide the data set into a number of groups pre-designated by the user. Hierarchical cluster methods produce a hierarchy of clusters, ranging from small clusters of very similar items to larger clusters of increasingly dissimilar items. This lecture will focus on hierarchical methods.
Partitioning methods are best applied when a specific number of clusters in the data are hypothesized, such as morphological data on specimens for a given number of species. One of the most well-known of the partitioning methods is k-means clustering, which partitions n samples into k clusters. K-means clustering tends to produce clusters of equal size, which may not be desirable in some situations.
Hierarchical methods produce a graph known as a dendrogram or tree that shows the hierarchical clustering structure. Some hierarchical methods are divisive, that progressively divides the one large cluster that contains all of the samples into pairs of smaller clusters, repeating this process until all clusters have been divided into individual samples. Other hierarchical methods are agglomerative and do the opposite: they start with individual samples and form a cluster of the most similar samples, progressively joining samples and clusters until all samples have been joined into a single large cluster.
Hierarchical methods are useful in that they are not limited to a pre-determined number of clusters and because they can display the similarity of samples across a wide range of scales. Agglomerative hierarchical methods are common in the natural sciences.
The data matrix for cluster analysis needs to be in standard form, with n rows of samples and p columns of variables, called an n x p matrix. Hierarchical agglomerative cluster analysis begins by calculating a matrix of distances among all pairs of samples. This is known as a Q-mode analysis; it is also possible to run an R-mode analysis, which calculates distances (similarities) among all pairs of variables. In a Q-mode analysis, the distance matrix is a square, symmetric matrix of size n x n that expresses all possible pairwise distances among samples. In an R-model analysis, the matrix has size p x p and expresses all possible pairwise distances (or similarities) among the variables. The remainder of these notes will focus on Q-mode analysis, that is, clustering of samples.
Before clustering has begun, each sample is considered a cluster, albeit a cluster containing a single sample. Clustering begins by finding the two clusters that are most similar, based on the distance matrix, and merging them into a new, larger cluster. The characteristics of this new cluster are based on a combination of all the samples in that cluster. This procedure of combining two cluster and recalculating the characteristic of the new cluster is repeated until all samples have been joined into a single large cluster.
A variety of distance metrics can be used to calculate similarity, and different types of data require different similarity coefficients. For data that have linear relationships, Euclidean distance is a useful and common measure distance measure. For data that show modal relationships, such as ecological data, the Sørenson distance (also called Bray or Bray-Curtis) is a better descriptor of similarity because it is based only on taxa that occur in at least one sample and it ignores the often large numbers of taxa that occur in neither sample.
The order in which clusters are joined is controlled by the linkage methods, and a variety of linkage methods are available. The nearest-neighbor or single-linkage method is based on the elements of two clusters that are most similar, whereas the farthest-neighbor or complete-linkage method is based on the elements that are most dissimilar. Both of these methods are therefore based on the outliers within clusters, which is often not desirable. The median, group average, and centroid methods all emphasize the central tendency of clusters and are less sensitive to outliers. Group average may be unweighted (also known as UPGMA, or unweighted pair-group method with arithmetic averaging) or weighted (WPGMA, or weighted pair-group method with arithmetic averaging). Ward’s method joins clusters based on minimizing the within-group sum of squares, and it tends to produce compact well-defined clusters.
The results of the cluster analysis are shown by a dendrogram, which lists all of the samples and indicates at what level of similarity any two clusters were joined. The x-axis is the measure of the similarity or distance at which clusters join and different programs use different measures on this axis.
In the dendrogram shown below, samples 1 and 2 are the most similar and join to form the first cluster (with a similarity or height of 5), followed by samples 3 and 4. The last two clusters to form are 1–2-3–4 and 5–9-6–7-8–10. In some cases, clusters grow by the joining of two smaller clusters, such as the joining of 1–2 and 3–4. In other cases, clusters grow by the addition of single samples, such as the join of 6 with 5–9, followed by the join of 7. The growth of a cluster by the repeated addition of single samples is called chaining.
A common goal is to determining the number of groups that are present in the data, and it should be clear from the hierarchical nature of a dendrogram that the number of clusters depends on the desired level of similarity. Often, one searches for natural groupings that are defined by long stems, such as the long stem extending to the right of cluster 1–2-3–4. Some have suggested that all clusters should be defined at a consistent level of similarity, such that one would draw a line across the dendrogram at a chosen level of similarity, and all stems that intersect that line would indicate a cluster. The strength of clustering is indicated by the level of similarity at which elements join a cluster. In the example above, elements 1–2-3–4 join at similar levels, as do elements 5–9-6–7-8–10, suggesting the presence of two major clusters in this analysis.
The dendrogram shown below displays more chaining, clustering at a wider variety of levels, and a lack of long stems, suggesting that there are no well-defined clusters within the data. Defining groups reflects a tradeoff between the number of groups and the similarity of elements in the group. If many groups are defined, each will have few samples, and those samples will be highly similar to one another, but analyzing so many groups may be complicated. If only a few groups are defined, the clusters will likely have many samples, with a lower level of similarity, but having fewer groups may make subsequent analysis more tractable.
Dendrograms are like mobiles: they can be freely rotated around any stem. For example, the dendrogram shown above could be spun to place the samples in a different order, such as shown below. Spinning a dendrogram is a useful way to accentuate patterns of chaining or the distinctiveness of clusters (although it does not help in this case). For many cluster analysis programs, spinning must be done separately and rather tediously in a graphics program, such as Illustrator. Spinning can be done much more directly in R.
Because of its agglomerative nature, clusters are sensitive to the order in which samples join. It is possible that clusters may not reflect the actual groups in the data very well.
Cluster analysis is also sensitive to the chosen distance metric and linkage method. Because of this, novices may be tempted to try many combinations of distance metric and linkage method in a hope that the right combination will emerge. Instead, such a shotgun approach usually results in a bewildering array of conflicting results. It is almost always better to choose a distance metric and linkage method based on a careful consideration of the data and the goals of the analysis.
Caution should be used when defining groups based on cluster analysis, particularly if long stems are not present. Even if the data form a cloud in multivariate space, cluster analysis will still form clusters, and they may not represent natural or meaningful groups. It is generally wise to compare a cluster analysis to an ordination to evaluate the distinctness of the groups in multivariate space.
Transformations may be needed to put samples and variables on comparable scales; otherwise, clustering may reflect sample size or be dominated by variables with large values. For example, ecological count data are commonly percent-transformed within rows to remove the effect of sample size. For ecological data and other data, it is common to perform a percent-maximum transformation within columns to prevent a single large variable from overwhelming important variations in numerically small variables.
The cluster package in R includes the many methods presented in Kaufman and Rousseeuw (1990). Curiously, the methods all have the names of women that are derived from the names of the methods themselves. Of the partitioning methods, pam() is based on partitioning around mediods, clara() is for clustering large applications, and fanny() uses fuzzy analysis clustering. Of the hierarchical methods, agnes() uses agglomerative nesting, diana() is based on divisive analysis, and mona() is based on monotheistic analysis of binary variables. Other functions include daisy(), which calculates dissimilarity matrices, but is limited to Euclidean and Manhattan distance measures.
This example will show how to apply cluster analysis to ecological data to identify groups of collections that have similar sets of species in similar proportions. The data consist of counts of 38 taxa in 127 samples of marine fossils from the Late Ordovician of the Cincinnati, Ohio region. Notice that most of the values are zero: samples typically contain only a few taxa, although many different taxa are present in the entire data set.
fossils <- read.table(file="c2.txt", header=TRUE, row.names=1, sep=",")
Because this data is based on counts of fossils where the number of specimens differs greatly among the samples (the smallest sample has 15 specimens and the largest has 295), we want the cluster analysis to reflect differences in the relative proportions of species, not sample size, so all abundances should be converted to percentages. To do this, we perform a percent transformation on rows, sometimes called standardization by totals. In addition, some taxa are very abundant and others are rare. Rather than have the cluster analysis dominated by the most abundant taxa, we follow the percent transformation with a percent maximum transformation on the columns, which will convert every value to a percent of the maximum abundance for each taxon. In this way, all values will range from zero (taxon is absent) to one (taxon is at its most abundant). The vegan package has a helpful set of transformation in the decostand() function that can perform both transformations easily. The .t1 and .t2 notation indicates transformation 1 and transformation 2.
fossils.t1 <- decostand(fossils, method="total")
fossils.t2 <- decostand(fossils.t1, method="max")
For ecological data, the most appropriate distance measure is the Bray metric, also known as the Sørenson metric. The vegdist() function in the vegan package makes the calculation of this similarity metric easy. For vegdist(), the Bray metric is the default..
fossils.t2.bray <- vegdist(fossils.t2)
The cluster analysis can be run directly on this similarity matrix. For ecological data, Ward’s method often produces nicely demarcated clusters and it is used here. Notice how suffixes are added to our object names (fossils, fossils.t2, fossils.t2.bray, fossils.t2.bray.agnes) to make it clear what each object is. This is a good habit to get into so that analyses are not run on the wrong object.
fossils.t2.bray.agnes <- agnes(fossils.t2.bray, method="ward")
The agnes object includes several other objects:
The dendrogram is displayed by calling plot() on the agnes object, and specifying which.plots=2. Setting cex to a value less than one may be necessary to prevent overlapping labels. Increasing the width of the plot window can also lessen crowding of labels.
plot(fossils.t2.bray.agnes, which.plots = 2, main="Fossil assemblages", cex=0.5)
A line was added to the plot at a height of 2.4, and at this level of similarity, three main clusters are apparent. Each of these clusters could be divided more finely if that aided in the interpretation. For example, the middle of these three clusters has two main subclusters. Analysis of them might show that this is a useful subdivision.
It is useful to understand why clusters group together, that is, it is useful to summarize the composition in each cluster. For ecological data, we could add up the abundance of each taxon in that cluster, re-express them as percentages, and sort them from the lowest to the highest. The first step in this process is to find the samples that mark the endpoints of each cluster. From our dendrogram, we can see that sample 2D001 is the left-most sample in cluster 1 and 2D046 is the rightmost. For cluster two, these endpoints are 2D004 and 2D069, and for cluster three, they are 2D003 and 2S032. The order object in agnes is a vector that gives the order of our original row numbers (in fossils.t2.bray, which is the same order as fossils.t2. and fossils) on the dendrogram. We need to find the set of original row numbers in each of the clusters, and that can be done with the which() command.
cluster1 <- fossils.t2.bray.agnes$order [which(fossils.t2.bray.agnes$order.lab=="2D001") : which(fossils.t2.bray.agnes$order.lab=="2D046")]
cluster2 <- fossils.t2.bray.agnes$order [which(fossils.t2.bray.agnes$order.lab=="2D004") : which(fossils.t2.bray.agnes$order.lab=="2D069")]
cluster3 <- fossils.t2.bray.agnes$order [which(fossils.t2.bray.agnes$order.lab=="2D003") : which(fossils.t2.bray.agnes$order.lab=="2S032")]
This code looks complicated but what it is finding is the row numbers in our original data that correspond to each of the clusters. From this, it is easy to extract the fossil counts for all the samples in each cluster.
cluster1.fossils <- fossils[cluster1, ]
cluster2.fossils <- fossils[cluster2, ]
cluster3.fossils <- fossils[cluster3, ]
From these collections, we can use colSums() to total the abundances of each taxon, convert these to a percentage, sort them from largest to smallest, then round them off to make them easier to read.
round(sort(colSums(cluster1.fossils) / sum(colSums(cluster1.fossils)), decreasing=TRUE), digits=2)
round(sort(colSums(cluster2.fossils) / sum(colSums(cluster2.fossils)), decreasing=TRUE), digits=2)
round(sort(colSums(cluster3.fossils) / sum(colSums(cluster3.fossils)), decreasing=TRUE), digits=2)
Cluster 1 is dominated by ramose trepostome bryozoans (ramos; 17%), strophomenid brachiopods (strop; 14%), dalmanellid brachiopods (dalma; 14%), and the brachiopod Rafinesquina (Rafin; 14%), a relatively deep-water association. Cluster 2 is dominated by Rafinesquina (Rafin; 29%), the brachiopod Zygospira (Zygos; 25%), and ramose trepostome bryozoans (ramos; 19%), a mid-depth association. Cluster 3 is dominated is dominated by ramose trepostome bryozoans (ramos; 42%), the brachiopod Hebertella (Heber; 11%) and the brachiopod Platystrophia ponderosa (Ponde; 8%), a characteristic shallow-water association.
This interpretation is confirmed by the letters in the sample names, where D indicates deep subtidal and S indicates shallow subtidal. All but one of the samples from cluster 1 is from the deep subtidal, about 2/3’s of the samples in cluster 2 are from the deep subtidal, and 2/3’s of the samples in cluster 3 are from the shallow subtidal.
Cluster analysis can also be applied to non-ecological data to find groups of similar samples. In this example, it is applied to the Purdin geochemistry data, which consists of geochemical measurements on limestone. The geochemical measurements are extracted from the data frame, and a log+1 transformation is applied to the right-skewed major elements.
purdin <- read.table("purdin.txt", header=TRUE, row.names=1, sep=",")
geochem <- purdin[ , 2:10]
geochem[ ,3:9] <- log10(geochem[ , 3:9] + 1)
To weight all the variables equally in the cluster analysis, a percent maximum transformation is applied, using the decostand() function from the vegan library.
geochem.t1 <- decostand(geochem, method="max")
For the cluster analysis, the default euclidean distance is appropriate as is the default arithmetic averaging linkage method. Note that in this case, agnes() is called on the data rather than a distance matrix, as it was in the ecological example.
geochem.t1.agnes <- agnes(geochem.t1)
Plotting is done the same as in the ecological example.
plot(geochem.t1.agnes, which.plots = 2, main="Limestone geochemistry", cex=0.5)
This dendrogram shows the presence of several clusters, including a large one in the center of the plot. The presence of two samples at the far right that join at a low level of similarity, and an additional sample just to their left, which also joins at a low level of similarity suggests the presence of outliers. It is often useful to remove such samples and re-run the cluster analysis.
The composition of these clusters could also be calculated as in the ecological example above, but this is, as they say, an exercise left for the reader.
Kaufman, L. and P.J. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
Legendre, P., and L. Legendre, 1998. Numerical Ecology. Elsevier: Amsterdam, 853 p.
McCune, B., and J.B. Grace, 2002. Analysis of Ecological Communities. MjM Software Design: Gleneden Beach, Oregon, 300 p.