| Contents |
Hierarchical Clustering
|
Example: Data Size: Different versions of XLMiner™ have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details. The figure below gives corporate data on 22 US public utilities. We are interested in forming groups of similar utilities. The objects to be clustered are the utilities. There are 8 measurements on each utility, as described below. An example where clustering would be useful is a study to predict the cost impact of deregulation. To do the requisite analysis, economists would need to build a detailed cost model of the various utilities. It would save a considerable amount of time and effort if we could cluster similar types of utilities, build detailed cost models for just one ”typical” utility in each cluster, then scale up from these models to estimate results for all utilities. The objects to be clustered are the utilities and there are 8 measurements on each utility. Before we can use any technique for clustering we need to define a measure for distances between utilities so that similar utilities are a short distance apart and dissimilar ones are far from each other. A popular distance measure based on variables that take on continuous values is to standardize the values by dividing by the standard deviation (sometimes other measures such as range are used) and then to compute the distance between objects using the Euclidean metric. Ex. 1
Clustering Stages: This output details the history of the cluster formation. Initially, each individual case is considered its own cluster (with just itself as a member), so we start off with # clusters = # cases (21 in the example above). At stage 1, above, clusters (i.e. cases) 10 and 13 were found to be closer together than any other two clusters (i.e. cases), so they are joined together in a cluster called Cluster 10. So now we have one cluster that has two cases (cases 10 and 13), and 19 other clusters that still have just one case in each. At stage 2, clusters 7 and 12 are found to be closer together than any other two clusters, so they are joined together into cluster 7. This process continues until there is just one cluster. At various stages of the clustering process, there are different numbers of clusters. A graph called a dendrogram lets you visualize this:
In the above dendrogram, the Sub Cluster IDs are listed along the x-axis (in an order convenient for showing the cluster structure). The y-axis measures inter-cluster distance. Consider cases 10 and 13 -- they have an inter-cluster distance of 1.51. No other cases have a smaller inter-cluster distance, so 10 and 13 are joined in a cluster, indicated by the horizontal line linking them. Next, we see that cases 7 and 12 have the next smallest inter-cluster distance, so they are joined. The next smallest inter-cluster distance is between clusters 14 and 19 and so on. If we draw a horizontal line through the diagram at any level on the y-axis (the distance measure), the vertical cluster lines it intersects indicate clusters whose members are at least that close to each other. If we draw a horizontal line at distance = 3.6, for example, we see that there are 4 clusters. We can see that a case can belong to multiple clusters, depending on where we draw the line (i.e. how close we require the cluster members to be to each other). Hence, the term "hierarchical." For purposes of assigning cases to clusters, we must decide in advance how many clusters we want to end up with. In this example, we specified a limit of 4. If the #training rows exceeds 30 then the dendrogram also displays Cluster Legends. Predicted Clusters: This output shows the assignment of cases to clusters (keep in mind that we specified earlier how many clusters we wanted).
Ex. 2 Distance Matrix : Sometimes the data is in the distance matrix form. This means every cell Ci j in it represents the distance between ith and jth record in the original data. (When applied to Raw data, Hierarchical clustering converts the data into the distance matrix format anyway before going ahead. So it is a step less for Hierarchical clustering.)
See also |