Contents

Hierarchical Clustering

Cluster Analysis 

Cluster Analysis, also called data segmentation, has a variety of goals. All relate to grouping or segmenting a collection of objects (also called observations, individuals, cases, or data rows) into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. Central to all of the goals of cluster analysis is the notion of degree of similarity (or dissimilarity) between the individual objects being clustered. There are two major methods of clustering -- hierarchical clustering and k-means clustering.

In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object.  Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings. Agglomerative techniques are more commonly used, and this is the method implemented in XLMiner™. Hierarchical clustering may be represented by a two dimensional diagram known as dendrogram which illustrates the fusions or divisions made at each successive stage of analysis. An example of such a dendrogram is given below:

Agglomerative methods

An agglomerative hierarchical clustering procedure produces a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single group containing all n cases. 

At each particular stage the method joins together the two clusters which are closest together (most similar).  (At the first stage, of course, this amounts to joining together the two objects that are closest together, since at the initial stage each cluster has one object.)   

Differences between methods arise because of the different ways of defining distance (or similarity) between clusters. Several agglomerative techniques will now be described in detail.

Single linkage clustering

One of the simplest agglomerative hierarchical clustering method is single linkage, also known as the nearest neighbor technique. The defining feature of the method is that distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. 

In the single linkage method, D(r,s) is computed as

D(r,s) = Min { d(i,j) : Where object i is in cluster r and object j is cluster s }

 
Here the distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the shortest link between the clusters. 

At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is minimum, are merged.

This measure of inter-group distance is illustrated in the figure below:

Complete linkage clustering

The complete linkage, also called farthest neighbor, clustering method is the opposite of single linkage.  Distance between groups is now defined as the distance between the most distant pair of objects, one from each group. 

In the complete linkage method, D(r,s) is computed as

D(r,s) = Max { d(i,j) : Where object i is in cluster r and object j is cluster s }

Here the distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s and the maximum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the longest link between the clusters. 

At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is minimum, are merged.

The measure is illustrated in the figure below:

 

Average linkage clustering

Here the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group. 

In the average linkage method, D(r,s) is computed as

D(r,s) = Trs / ( Nr * Ns)
Where Trs is the sum of all pairwise distances between cluster r and cluster s. Nr and Ns are the sizes of the clusters r and s respectively.  

At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is the minimum, are merged.

The figure below illustrates average linkage clustering:

 

Average group linkage

With this method, groups once formed are represented by their mean values for each variable, that is, their mean vector, and inter-group distance is now defined in terms of distance between two such mean vectors.

In the average group linkage method, the two clusters r and s are merged such that, after merger, the average pairwise distance within the newly formed cluster, is minimum. Suppose we label the new cluster formed by merging clusters r and s, as t. Then D(r,s) , the distance between clusters r and s is computed as

D(r,s) = Average { d(i,j) : Where observations i and j are in cluster t, the cluster formed by merging clusters r and s }

 
At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is minimum, are merged. In this case, those two clusters are merged such that the newly formed cluster, on average, will have minimum pairwise distances between the points in it.

Ward's  hierarchical clustering method

Ward (1963) proposed a clustering procedure seeking to form the partitions Pn, P n-1,........,  P in a manner that minimizes the loss associated with each grouping, and to quantify that loss in a form that is readily interpretable. At each step in the analysis, the union of every possible cluster pair is considered and the two clusters whose fusion results in minimum increase in 'information loss' are combined. Information loss is defined by Ward in terms of an error sum-of-squares criterion, ESS.

The rationale behind Ward's proposal can be illustrated most simply by considering univariate data. Suppose for example, 10 objects have scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) on some particular variable. The loss of information that would result from treating the ten scores as one group with a mean of 2.5 is represented by ESS given by,

ESS One group  = (2 -2.5) + (6 -2.5) + ....... + (0 -2.5) = 50.5 

On the other hand, if the 10 objects are classified according to their scores into four sets,

{0,0,0}, {2,2,2,2}, {5}, {6,6}

The ESS can be evaluated as the sum of squares of four separate error sums of squares

ESS One group = ESS group1  + ESSgroup2 + ESSgroup3 + ESSgroup4  = 0.0

Thus, clustering the 10 scores into 4 clusters results in no loss of information.

See also