Contents

 

Naive Bayes Classification 

 

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

Figure below shows data, which are test results on flying fitness tests for 40 pilots. There are five categorical variables (named var2, through var6) indicative of the performance of the pilots on various physical and psychological tests. 

  1. Open Flying_Fitness.xls dataset from datasets folder.

  2. Open XLMiner™ menu and select Partition data --> Standard Partition. Use the following settings.

     

  3. Open XLMiner™ menu and select Classification and click on Naïve Bayes to invoke Naïve bayes dialog box. First dialog box of Naïve Bayes contains all variables, which are available for selection. Select Var2 to Var6 as input variables and TestRes/Var1 as output variable as shown in figure below. Click on Next button to proceed.

  4. In the second dialog box of Naïve Bayes, select Calculate according to relative occurrences and click on Next button to proceed.

  5. Check the score options to get the required output, all options for training and validation in this case.

    Score training data:  Select this option to show an assessment of the performance in classifying the training data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score validation data:  Select this option to show an assessment of the performance in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score Test Data:  The options in this group let you apply the model for scoring to the test partition (if one had been created earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data. 

    Score new Data:  The options in this group let you apply the model for scoring to an altogether new data. Specify where the new data is located. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

  6. The output of Naïve Bayes is displayed on a separate sheet and you can view various sections of output using Output Navigator.

    See the output for Classification of Validation Data below. While predicting the class of output variable, XLMiner™ calculates the conditional probability that the variable may be classified to a particular class. In this case the classes are 0 and 1. For every record in the validation data the conditional probabilities for class - 0 and for class - 1 are calculated as shown below. The maximum value amongst these probabilities is highlighted. XLMiner™ assigns that class to the output variable, for which the conditional probability is maximum.     

    In addition to the classified data (above), you can also view the prior class probabilities (in this case, the training data had 54.17% "1's" and 45.83% "0's".

The conditional probabilities are also shown.  In this case, of the cases classified as "1," 15.38% had a value of "0" for variable 2.  The remaining 84.62%  had a value of "1" for variable 2.  There were no cases classified as a "1" where variable 2 was a "2."

NNB_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

See also