Contents

Hierarchical Clustering

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

The figure below gives corporate data on 22 US public utilities. We are interested in forming groups of similar utilities. The objects to be clustered are the utilities. There are 8 measurements on each utility, as described below.  

An example where clustering would be useful is a study to predict the cost impact of deregulation. To do the requisite analysis, economists would need to build a detailed cost model of the various utilities. It would save a considerable amount of time and effort if we could cluster similar types of utilities, build detailed cost models for just one ”typical” utility in each cluster, then scale up from these models to estimate results for all utilities. 

The objects to be clustered are the utilities and there are 8 measurements on each utility. Before we can use any technique for clustering we need to define a measure for distances between utilities so that similar utilities are a short distance apart and dissimilar ones are far from each other. A popular distance measure based on variables that take on continuous values is to standardize the values by dividing by the standard deviation (sometimes other measures such as range are used) and then to compute the distance between objects using the Euclidean metric.

Ex. 1 

  1. Open the file Utilities.xls. The figure below shows the dataset, and explanations of the variables.

  2. In XLMiner™ menu select data reduction and exploration and click on Hierarchical Clustering option. In Hierarchical clustering dialog box select x1 to x8 variables and click on Next button. Figure below shows first dialog box of hierarchical clustering with the selection and various options are explained below.

    Data Range: Specify the range containing the data to be clustered. You can either type the address directly in this text box, or use the mouse to point to the desired range. If the active cell was already somewhere in the data range when you invoked this dialog, XLMiner™ would detect the data range automatically. 

    Data Type : Hierarchical clustering can be used on Raw data (like the Utilities dataset above) or data in the distance matrix format (Explained in Ex 2.) Choose Raw data here.

    Variable Names in the First Row: When this box is checked, XLMiner™ picks up variable names from the headers in the first row of the selected data range. When the box is unchecked, XLMiner™ follows the default naming convention i.e., the variable in the first column of the selected range will be called "Var1," the second column as "Var2," etc.

    Variables: This list box displays all the available variables in the data range. If one of the columns of the selected data range contains all non-numeric entries then XLMiner™ ignores that column and it is not displayed in the list of available variables.

    Selected Variables:  From the list of all available variables, select those to be used in the clustering process. To perform this selection, first click on a variable name in the Variables list box and then click on the transfer button (the button with ">" or "<" label on it). This brings the variable from the Variables list box to the Selected Variables list box. To remove a variable from the list of selected variables, click on the variable name in the Selected Variables list box and then click on the transfer button. This will put the variable back in the Variables list box.

     

  3. In the second dialog box of Hierarchical Clustering select Normalize input data option and select the desired clustering method (see the introduction to this section for details of each method) and click on the Next button. Figure below shows second dialog box of hierarchical clustering and it can be observed that Average group linkage method is selected.

    Normalize input data:  Normalizing the data (subtracting the mean and dividing by the standard deviation) is important to ensure that the distance measure accords equal weight to each variable -- without normalization, the variable with the largest scale will dominate the measure.

    Similarity Measure : The option Euclidean distance is automatically chosen as explained in "Using Hierarchical Clustering". The other options are activated only for binary data.

    Clustering Method : Select average group linkage method.

  4. In the last dialog box, check the "Draw dendrogram" and "Show cluster membership" check boxes and type the desired number of clusters (in this case we have chosen 4 clusters to be displayed).  Then click on Finish. 

  5. The output of Hierarchical Clustering is displayed in a separate sheet and various sections can be viewed using Output Navigator. 

  6. Clustering Stages:  

    This output details the history of the cluster formation.  Initially, each individual case is considered its own cluster (with just itself as a member), so we start off with # clusters = # cases (21 in the example above). At stage 1, above, clusters (i.e. cases) 10 and 13 were found to be closer together than any other two clusters (i.e. cases), so they are joined together in a cluster called Cluster 10.  So now we have one cluster that has two cases (cases 10 and 13), and 19 other clusters that still have just one case in each.  At stage 2, clusters 7 and 12 are found to be closer together than any other two clusters, so they are joined together into cluster 7.

    This process continues until there is just one cluster.  At various stages of the clustering process, there are different numbers of clusters.  A graph called a dendrogram lets you visualize this:  

    In the above dendrogram, the Sub Cluster IDs are listed along the x-axis (in an order convenient for showing the cluster structure).  The y-axis measures inter-cluster distance.  Consider cases 10 and 13 -- they have an inter-cluster distance of 1.51.  No other cases have a smaller inter-cluster distance, so 10 and 13 are joined in a cluster, indicated by the horizontal line linking them. Next, we see that cases 7 and 12 have the next smallest inter-cluster distance, so they are joined.

    The next smallest inter-cluster distance is between clusters 14 and 19 and so on.

    If we draw a horizontal line through the diagram at any level on the y-axis (the distance measure), the vertical cluster lines it intersects indicate clusters whose members are at least that close to each other.  If we draw a horizontal line at distance = 3.6, for example, we see that there are 4 clusters.  We can see that a case can belong to multiple clusters, depending on where we draw the line (i.e. how close we require the cluster members to be to each other).  Hence, the term "hierarchical."

    For purposes of assigning cases to clusters, we must decide in advance how many clusters we want to end up with.  In this example, we specified a limit of 4.

    If the #training rows exceeds 30 then the dendrogram also displays Cluster Legends.

    Predicted Clusters:  This output shows the assignment of cases to clusters (keep in mind that we specified earlier how many clusters we wanted).

     

Ex. 2  Distance Matrix : Sometimes the data is in the distance matrix form. This means every cell Ci j  in it represents the distance between ith and jth record in the original data. (When applied to Raw data, Hierarchical clustering converts the data into the distance matrix format anyway before going ahead. So it is a step less for Hierarchical clustering.) 

  1. Open the file DistMatrix.xls. 

  2. Click on XLMiner --> Data Reduction and Exploration --> Hierarchical Clustering.
  3. Select the data range as shown above. The data range should not contain the names of the states. It should start from the cell C3. Choose the Distance Matrix as the data type. Note that no variable names exist here.

    Click Next.

  4. Select Average group linkage as the clustering method.

  5. Select the no. of clusters =4.

    Check for Draw Dendrogram and Show cluster membership. 

  6. Click Finish to get the output.

    The Dendrogram appears as follows :

  7. See also