Contents

k-Nearest Neighbors Classification

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

  1. Open the file Iris.xls in Excel. The objective is to classify the rows (cases) by species, using the other available information. 

  2.  

  3. In XLMiner™, select Partition Data --> Standard Partition.  In the dialog box set seed to be 47242, just as an illustration.

    The figure below shows the settings selected.

    Select all the variables and click "OK". Partition output is shown in the figure below:

  4.  

  5. In XLMiner™, select Classification --> k-Nearest neighbors. In the k-NN dialog box, select Petal_width, Petal_length, Sepal_width and Sepal_length as input variables and Species_name as the output variable for this exercise. (Species_No gives the same information as Species_Name so can be ignored). Click on Next to proceed.

  6. Variables: This box lists all the variables present in the dataset. If the "First row contains headers" box is checked, the header row above the data is used to identify variable names.

    Variables in input data: Select one or more variables as independent variables from the Variables box by clicking on the corresponding selection button. These variables constitute the predictor variables.

    Output Variable: Select one variable as the dependent variable from the Variables box by clicking on the corresponding selection button. This is the variable being classified.

    Click Next, and the following dialog box appears.

  7. Here is the second dialog box of k-NN. Type 10 in the Number of nearest neighbors box; this is a somewhat arbitrary number but is based on standard practice. Figure below shows second dialog box of k-NN and various options to be selected are explained below.

    Normalize Input data: Check this option if the input data are to be normalized (this will express all data in terms of standard deviations so that the distance measure is not dominated by variables with a large scale).

    Number of Nearest Neighbors: This is the parameter k in the k-nearest neighbor algorithm. The value of k should be between 1 and the total number of observations (rows). Note that if k is chosen as the total number of observations in the training set, then for any new observation, all the observations in the training set become nearest neighbors. In this case the predicted response for each new or test case becomes just the overall average of the response variable in the training set. 

    Scoring option : Select the second option for this example. Thus XLMiner™ will display the output for the best K between 1 and 10. If we select the first option, the output will be displayed for the specified value of k.

    Score training data:  Select this option to show an assessment of the performance in classifying the training data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score validation data:  Select this option to show an assessment of the performance in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score Test Data:  The options in this group let you apply the model for scoring to the test partition (if one had been created earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data. 

    Score new Data:  The options in this group let you apply the model for scoring to an altogether new data. Specify where the new data is located. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

    Click on Finish.

     

  8. The output of k-NN is displayed in a separate sheet and various sections of the output can be navigated using the Output Navigator bar. We have selected the option of scoring using the best K. The validation error log for different k lists the %errors for all values of k for the training and validation data sets and selects that value as best k for which the %error validation is minimum. The scoring is performed later using this best value of k.

    Of particular interest is the Validation Misclassification Summary, which tallies the actual and computed classifications as the model was applied to the validation data.  Correct classification counts are along the diagonal from upper left to lower right.  In this case, there were 3 misclassifications (3 cases in which Verginicas were misclassified as Versicolors).

The Classification of Validation Data output, below, shows for each record the predicted class (based on the class of the nearest neighbors), the percent of the nearest neighbors belonging to that class and the actual class. See that the class with the highest probability is highlighted in yellow. Some intermediate rows in the following figure are compressed to show the row 107 in which mismatch between predicted and actual class is highlighted in green. 

KNNC_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

 

See also