Examining the Dendrograms

Page 33 out of 225 pages in this book.
Tripos Bookshelf > QSAR > QSAR Tutorials > Clustering Methods Tutorial

2.3.4 Examining the Dendrograms

1. At the start of the hierarchical clustering algorithm, each row is considered a cluster unto itself. Let's call these singleton clusters B, F, H, L, P, and Z to distinguish them from the corresponding points Bob, Flo, Herb, Liz, Pat, and Zeke. At each step thereafter, those clusters which are closest together¹ will be merged.

Bob

Flo

Select rows Bob and Flo in cluster_types.

Press Show RowSel.

The corresponding points are highlighted in the graphs.

Press Cancel in the Locate prompting dialog.

2. The methods diverge in the next step, because they have different definitions of how far clusters are from each other. Consider Herb, which is at the bottom center of the Methods graph.

Select the third row, Herb, then press Show RowSel.

Flo

point

Herb

Liz

Flo

cluster

Under Complete linkage, the distance from cluster H to cluster C1 is the maximum pairwise distance from Herb to any point in cluster C1, i.e., the distance to Bob. This is greater than the distance to Liz, so the second Complete clustering step entails creating cluster C2 from Herb and Liz. Herb is the third clustering point, Liz is the fourth.

Figure 8 Complete Clustering

Under Single linkage, the distance from cluster H to cluster S1 is defined as the minimum pairwise distance from Herb to any point in S1, i.e., the distance to Flo. This is smaller than the distance to Liz, so the second clustering step is to add Herb to S1; call the result S2.

Figure 9 Single Clustering

With Average linkage, the distance from Herb to Cluster A1 is defined as the average pairwise distance from Herb to each point in A1. This happens to be slightly less than the distance to Liz, so Average linkage also add Herb to create A2 in the second clustering step.

Figure 10 Average Clustering

3. The second clustering step for Complete linkage assigned Liz as the fourth point from the left. What is the next clustering level? The distance from Pat to either Flo or Bob is 2.256, whereas the distance between Pat and Zeke is 2.250. Hence clusters P and Z are consolidated to create C3.

Complete

Bob

Liz

Bob

Zeke

Liz

Pat

Bob

Liz

4. Under the Single linkage method, the third step (the third horizontal bar moving up from the bottom of the dendrogram) entails adding another point to S2. Which point is it?

MSS: Pick Points

Click on the fourth point from the left hand side of the Single dendrogram shown in D4.

In the message area you see:

    
        	 Pick: 
        	 ROW 4 LIZ

Liz

any

Herb

Liz

Pat

Zeke

5. The progression for Average clustering is less intuitive. In this case, the third clustering level after A2 was formed entailed pairing up Liz with some element not in A2. Which one?

Click on the right most point at the bottom of the Average dendrogram shown in D3.

In the message area you see:

    
        	 Pick: 
        	 ROW 6 ZEKE

Click on End Select to terminate point picking.

Zeke

Liz

Pat

Zeke

average

Liz

Pat

This discussion is cast in terms of distance for ease of discussion. In fact, however, the hierarchical clustering algorithm actually uses similarity to determine which agglomeration to do next. Similarity is inverse to distance where distance is well-defined, but, unlike distance, can be used where the triangle inequality may not apply, e.g., for fingerprint descriptors.