Contents

k-Nearest Neighbors (k-NN) Prediction

Introduction

In k-nearest-neighbor prediction, the training data set is used to predict the value of a variable of interest for each member of a "target" data set. The structure of the data is that there is a variable of interest ("amount purchased," for example), and a number of additional predictor variables (age, income, location...). Generally speaking, the algorithm is as follows:

  1. For each row (case) in the target data set (the set to be predicted), locate the k closest members (the k nearest neighbors) of the training data set. A Euclidean Distance measure is used to calculate how close each member of the training set is to the target row that is being examined.     

  2. Find the weighted sum of the variable of interest for the k nearest neighbors (the weights are the inverse of the distances).

  3. Repeat this procedure for the remaining rows (cases) in the target set.

  4. Additional to this XLMiner™ also lets the user select a maximum value for k, builds models parallelly on all values of k upto the maximum specified value and scoring is done on the best of these models. 

Of course the computing time goes up as k goes up, but the advantage is that higher values of k provide smoothing that reduces vulnerability to noise in the training data. In practical applications, typically, k is in units or tens rather than in hundreds or thousands. 

See Also: