Contents

 

k-Nearest Neighbors Prediction

 

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

Let's use the Boston housing dataset to predict the median value of a housing unit in housing tracts in the Boston area. This dataset has 14 variables and the description of each variable is given in the table below. The dependent variable MEDV is the median value of a dwelling.  In this exercise we will use k-Nearest Neighbors to predict the value of MEDV.

CRIM

Per capita crime rate by town

ZN      

Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS

Proportion of non-retail business acres per town

CHAS

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX

Nitric oxides concentration (parts per 10 million)

RM

Average number of rooms per dwelling

AGE

Proportion of owner-occupied units built prior to 1940

DIS

Weighted distances to five Boston employment centers

RAD

Index of accessibility to radial highways

TAX     

Full-value property-tax rate per $10,000

PTRATIO

Pupil-teacher ratio by town

B

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT   

% Lower status of the population

MEDV    

Median value of owner-occupied homes in $1000's

  1. Open Boston_Housing.xls from the datasets folder.  The figure below shows the dataset. We will not be using the variable in the last column (CAT.MEDV), which is merely a discrete classification of the MEDV variable into high and low median income.

  2. Partition the data using XLMiner--> Partition Data --> Standard Partition. Use the settings as shown below.

  3. Select Prediction --> k-Nearest neighbors, and select from CRIM to LSTAT as input variables and MEDV as output variable and click on Next to proceed. 

  4. In the second dialog box of k-NN, select Normalize input data, type 5 for Number of nearest neighbors. .

    Normalize Input data: Check this option if the input data are to be normalized (this will express all data in terms of standard deviations so that the distance measure is not dominated by variables with a large scale).

     

    Number of Nearest Neighbors: This is the parameter k in the k-nearest neighbor algorithm. The value of k should be between 1 and the total number of observations (rows).  Typically, this is chosen to be in units or tens.

     

    Scoring option : Select the second option for this example. Thus XLMiner™  will display the output for the best K between 1 and 5. If we select the first option, the output will be displayed for the specified value of k.

     

    Score training data:  Select this option to show an assessment of the performance in predicting the training data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score validation data:  Select this option to show an assessment of the performance in predicting the validation data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score Test Data:  The options in this group let you apply the model for scoring to the test partition (if one had been created earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data. 

    Score new Data:  The options in this group let you apply the model for scoring to an altogether new data. Specify where the new data is located. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. click on Finish

  5. The output of k-NN prediction is displayed in a separate sheet. As per our specifications XLMiner™ calculates the RMS error for all values of K and decides that value of k the best for which the RMS error is minimum.

The Summary report, below, summarizes the prediction error.  The first number, the total sum of squared errors, is the sum of the squared deviations (residuals) between the predicted and actual values.  The second is the square root of the average of the squared residuals. The third is the average deviation. All these values are calculated for the best k, ie k=2.

The Prediction of Validation Data screen, below, shows the predicted value, the actual value and the difference between them (the residual) for each record.

KNNP_Stored_1: XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

 Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

The Lift charts for the training and validation data are shown below.

See also: