Scope of Hierarchical Cluster Analysis

Page 140 out of 225 pages in this book.
Tripos Bookshelf > QSAR > QSAR Theory > QSAR Techniques

4.5.6.1 Scope of Hierarchical Cluster Analysis

As in all multivariate analyses, the most important success factor in hierarchical clustering is relevant column property values. For hierarchical clustering in particular, the properties ought to be similar in kind and in numerical scale. Properties whose values differ most numerically tend to dominate the course of the analysis.

Hierarchical clustering makes heavy demands on computation. Both memory and processing time increase as the square of the number of rows in the original table. However, the effect of additional columns is much less severe.

A small example of the use of hierarchical clustering is given below. The following portion of a larger table was submitted for cluster analysis:

 
        TITLE:  bcdef 
                                   1         2         3 
                                  ActCo      LogP        MR 
                               --------- --------- --------- 
         1 Methane                -1.43      0.80      6.58 
         2 Ethane                 -1.31      1.81     11.27 
         3 Propane                -1.46      2.36     15.96 
         4 n-Butane               -1.58      2.89     20.63 
         5 2-Methylpropane        -1.68      2.76     20.85 
         6 n-Pentane              -1.71      3.23     25.27 
         7 2,2-DimethylPropane    -1.95      3.11     25.73 
         8 2,2-DimethylButane     -1.90      3.82     29.94 
         9 Cyclopentane           -0.88      3.00     23.14 
        10 Cyclohexane            -0.90      3.44     27.71

The hierarchical clustering analysis was invoked with the following standard settings. At each invocation these tailor settings may be changed.

 
        Subject:   TAILOR!HIER 
         
        Option Name         Description                             Value 
        -----------         -----------                             ----- 
         
        CLUSTERING_METHOD   Method for determining cluster centers  SINGLE 
        INPUT_FORM          Whether to compute distances            NORMAL_TABLE 
        MOD_ANGLES          Whether to assume periodic variables    NO 
         
        Dendrogram was calculated. 
        Name for analysis> sample

Figure 25

MSS: QSAR >>> Graph QSAR

QSAR ANALYSIS GRAPH

MSS: QSAR >>> Report QSAR

QSAR ANALYSIS LIST

 
        Cluster Node Membership 
         
         #                  Methane  Ethane  Propane  n-Butane  2-Methylpropane 
         -                  -------  ------  -------  --------  --------------- 
         1 Cluster Level 1        1       2        3         4                5 
         2 Cluster Level 2        1       2        3         4                4 
         3 Cluster Level 3        1       2        3         4                4 
         4 Cluster Level 4        1       2        3         4                4 
         5 Cluster Level 5        1       2        3         4                4 
         6 Cluster Level 6        1       2        3         4                4 
         7 Cluster Level 7        1       2        3         4                4 
         8 Cluster Level 8        1       2        3         3                3 
         9 Cluster Level 9        1       2        2         2                2 
        10 Cluster Level 10       1       1        1         1                1 
         #                  n-Pentane 2,2-DimethylPropane 2,2-DimethylButane 
         -                  --------- ------------------- ------------------ 
         1 Cluster Level 1          6                   7                  8 
         2 Cluster Level 2          6                   7                  8 
         3 Cluster Level 3          6                   6                  8 
         4 Cluster Level 4          6                   6                  8 
         5 Cluster Level 5          6                   6                  8 
         6 Cluster Level 6          4                   4                  8 
         7 Cluster Level 7          4                   4                  4 
         8 Cluster Level 8          3                   3                  3 
         9 Cluster Level 9          2                   2                  2 
        10 Cluster Level 10         1                   1                  1 
         
         #                  Cyclopentane Cyclohexane 
         -                  ------------ ----------- 
         1 Cluster Level 1             9          10 
         2 Cluster Level 2             9          10 
         3 Cluster Level 3             9          10 
         4 Cluster Level 4             9           6 
         5 Cluster Level 5             6           6 
         6 Cluster Level 6             4           4 
         7 Cluster Level 7             4           4 
         8 Cluster Level 8             3           3 
         9 Cluster Level 9             2           2 
        10 Cluster Level 10            1           1

As always, the initial cluster level has each row alone in a cluster, and the final level has all rows combined into a single cluster. Reading horizontally, we see that the first clustering (at Cluster Level 2) joins rows 4 and 5. Reading vertically, we see that ethane joins another cluster at the next to last level and methane survives alone until the end.

 
        Cluster Nodes Above Individual Rows 
         
         #            Left-son Clusters Right-son Clusters 
         -            ----------------- ------------------ 
         1 Node 10+1                  4                  5 
         2 Node 10+2                  6                  7 
         3 Node 10+3                 12                 10 
         4 Node 10+4                 13                  9 
         5 Node 10+5                 11                 14 
         6 Node 10+6                 15                  8 
         7 Node 10+7                  3                 16 
         8 Node 10+8                  2                 17 
         9 Node 10+9                  1                 18 
        10 Node 10+10                19                  0 
         
        Connection levels 
         
        Level11   Level12   Level13   Level14   Level15   Level16   Level17 
        -------   -------   -------   -------   -------   -------   ------- 
          0.274     0.533     2.265     2.298     2.438     2.473     4.702 
        Level18   Level19   Level110 
        -------   -------   -------- 
          4.725    4.799      5.251

These two listings indicate the cluster formations as they are graphed in the dendrogram. The first clustering of rows 4 and 5 occurs at a fairly low connection level (0.274), with rows 6 and 7 joining next. The third cluster level forms by joining Node 12 with Node 10, i.e. the cluster of rows 6 and 7 is joined to row 10. At almost the same connection level (vertical level on the dendrogram) we see the next cluster forming between the one just formed (Node 13) and row 9.

 
        Setup  : HIERARCHICAL CLUSTERING Analysis 
          Minimum Sigma to use Column:  0.0000   Missing Values: DROP_ROW 
          Clustering Method: SINGLE_LINKAGE   Input Form: NORMAL 
          No distances were periodic. 
          Columns 
           1: ActCo                 2: LogP                  3: MR 
          Rows (10 rows) 
           1: Methane               2: Ethane                3: Propane 
           4: n-Butane              5: 2-Methylpropane       6: n-Pentane 
           7: 2,2-DimethylPropane   8: 2,2-DimethylButane    9: Cyclopentane 
          10: Cyclohexane

The exact description of the analysis is also available in the listings.

Figure 25 Dendrogram of hierarchical clustering example.

dendrogram

Figure 25

Pick Points

TABLE EXAMINE

Warning:

The analyses are still valid and can be worked with through analysis listing as described above. Such analyses can also be output as cluster columns in the Advanced CoMFA module, but cannot be graphed as dendrograms.

Dendrograms can be used in at least two ways: to see which compounds are most similar in terms of the distance metric chosen, and to select a subset of compounds which are representative of the larger data set.

Here the most similar compounds cluster together at the lowest levels. Thus the most similar pairs are rows 4 and 5, and rows 6 and 7. The compounds in rows 1, 2 and 3 are somewhat similar to each other but quite different from the others. This is mainly due to the strong differences in MR for those rows as seen in the table listing above.

If the goal was to select 3 compounds of the 10 for more detailed study, it would be reasonable to select one compound from each of the top three levels. The top level contains row 1 and rows 2-10, so row 1 would be selected. The second level contains row 2 and rows 3-10, so row 2 is selected. And one compound from the remaining set would be needed, perhaps row 10.

The usefulness of the clustering is dependent on careful choice of the columns used, which determines the distances on which clustering operates.

Cluster columns are another powerful tool for working with cluster analyses. In the simplest case, such a column simply indicates which cluster each row belongs to at a specified level of clustering. The numbering of the clusters in this case, however, do not in general say anything about the relationship between clusters.

More information from a dendrogram can be captured in a cluster column by coordinating the numbering of clusters across several levels, so that low-level clusters which are related at higher levels bear cluster ID numbers which are close to one another. This happens to be the case in the listings given above: clusters 4-9 at cluster level 2 are all included in cluster 4 at cluster level 6. This introduces a partial order into the clustering which is useful for many graphing applications.