Contents

 

k-Means Clustering

 

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

  1. Open the file Wine.xls in Excel. The figure below shows the data set, where each row represents a sample of wine taken from one of three wineries (A, B and C).  In this problem, the Type variable representing the winery is ignored and the clustering is performed simply on the basis of the properties of the wine samples (the remaining variables).

  2. In XLMiner™, select Data Reduction and Exploration --> k-Means clustering. The following dialog box will appear:

    Data Range: Specifies the range of input data used for partitioning. XLMiner™ automatically picks active data range.  You can also enter the range address, or select it with the mouse. 

    Variables: This box lists all the variables present in the dataset. If the "First row contains headers" box is checked, the header row above the data is used to identify variable names.

    Select all the variables except Type.

  3. Click on Next to advance to the second dialog box. 

    Normalize input Data:  Normalizing the data is important to ensure that the distance measure accords equal weight to each variable -- without normalization, the variable with the largest scale will dominate the measure.

    # Clusters:  Select the number of final clusters to be formed. This is actually the parameter k in the k-means clustering. The number of clusters should be at least 2 and at most the number of observations in the data range.  Set this value based on your best estimate of how many clusters there will be; it is a good idea to repeat the procedure with several different values.

    # Iterations:  This determines how many times the program will start with an initial partition and follow through with the clustering algorithm.  The configuration of clusters (and how good a job they do of separating the data) may differ from one starting partition to another. The program will go through the specified number of iterations, and select the cluster configuration that minimizes the distance measure.

    Options : With Fixed start, XLMiner™ starts building the model with a single fixed starting point. If we select Random starts the algorithm starts at any random point. You have to specify the No. of starts and XLMiner™ generates as many cluster sets. It decides which is the best one and releases the output generated using the best cluster set . We also have the option of fixing the seed when we select Random starts. 

    For this example, we will do the following settings.

  4. Select these options to display the corresponding output.

  5. Click on Finish button to get the results. You will see:

    XLMiner™ calculates the sum of square distances and decides the Best start. It then generates the further outputs taking the Best start as the starting point.

In the output for "cluster centers" above, the upper box shows the variable values at the cluster centers. The lower box shows the distance between those cluster centers.

 Data summary shows how many records (observations) there are in each cluster, and the average distance from cluster members to the center of the cluster.

The final part of the output, above, shows the cluster to which each record belongs  and its distance to each of the clusters. Note that, for record 5, the distance to cluster 1 is the minimum distance, so record 5 is assigned to cluster 1. 

See also