-
Open the file Iris.xls in
Excel. The objective is to classify the rows (cases) by species, using the other available
information.

- In
XLMiner™, select
Partition Data --> Standard Partition. In the dialog box set seed to be 47242, just
as an illustration.
The figure below shows
the settings selected.

Select all the variables and click
"OK". Partition output is shown in
the figure below:

- In
XLMiner™, select Classification
--> k-Nearest neighbors. In the k-NN dialog box, select Petal_width,
Petal_length, Sepal_width and Sepal_length as input
variables and Species_name as the output variable for this exercise. (Species_No
gives the same information as Species_Name so can be ignored). Click on Next to proceed.

Variables: This box lists
all the variables present in the dataset. If the "First row contains
headers" box is checked, the header row above the data is used to
identify variable names.
Variables in input data:
Select one or more variables as independent variables from the Variables box
by clicking on the corresponding selection button. These variables constitute
the predictor variables.
Output Variable: Select one variable as
the dependent variable from the Variables box by clicking on the corresponding
selection button. This is the variable being classified.
Click Next, and the following dialog
box appears.
- Here is the second dialog box of
k-NN. Type 10 in the Number of nearest
neighbors box; this is a somewhat arbitrary number but is based on standard
practice. Figure below shows second dialog box
of k-NN and various options to be selected are explained below.

Normalize
Input data: Check
this option if the input data are to be normalized (this will express
all data in terms of standard deviations so that the distance measure is
not dominated by variables with a large scale).
Number of
Nearest Neighbors: This is the parameter k in the k-nearest neighbor
algorithm. The value of k should be between 1 and the total number of
observations (rows). Note that
if k is chosen as the total number of observations in the training set,
then for any new observation, all the observations in the training set
become nearest neighbors. In this case the predicted response for each
new or test case becomes
just the overall average of the response variable in the training set.
Scoring option : Select the
second option for this example. Thus XLMiner™ will display the output for the
best K between 1 and 10. If we select the first option, the output will be
displayed for the specified value of k.
Score training data:
Select this option to show an assessment of the performance in classifying the training data. The report is displayed according to
your specifications - Detailed, Summary and Lift charts.
Score validation data:
Select this option to show an assessment of the performance in classifying the validation data. The report is displayed according to
your specifications - Detailed, Summary and Lift charts.
Score Test Data: The options in this group let you apply the model for scoring to the test partition (if one had been created
earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data.
Score new Data: The options in this group let you apply the model for scoring to an altogether new data.
Specify where the new data is located. See
the Example of Discriminant Analysis for detailed instructions on this.
Score New data in database : See
the Example of Discriminant Analysis for detailed instructions on this.
Click on Finish.
- The output
of k-NN is displayed in a separate sheet and various sections of the output can be
navigated using the Output Navigator bar. We have selected the option of
scoring using the best K. The validation error log for different k lists the
%errors for all values of k for the training and validation data sets and
selects that value as best k for which the %error validation is minimum. The
scoring is performed later using this best value of k.

Of particular interest is the
Validation Misclassification Summary, which tallies the actual and computed
classifications as the model was applied to the validation data. Correct
classification counts are along the diagonal from upper left to lower
right. In this case, there were 3 misclassifications (3
cases in which Verginicas were misclassified as Versicolors).

The Classification of Validation
Data output, below, shows for each record the predicted class (based on the
class of the nearest neighbors), the percent of the nearest neighbors belonging
to that class and the actual class. See that the class with the highest
probability is highlighted in yellow. Some intermediate rows in the following
figure are compressed to show the row 107 in which mismatch between predicted
and actual class is highlighted in green.

KNNC_Stored_1 : XLMiner™ generates this sheet along with the
other outputs. Please refer to the
Stored Model Sheets for details.