Contents

Classification Tree

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

We will use the Boston Housing data set to predict whether the median housing price in housing tracts in the Boston area falls into the "high" or "low" category. This data set has 14 variables and a description of each variable is given in the table below. In addition to these variables, an additional variable has been created by categorizing median value (MEDV) into two categories.

CRIM

Per capita crime rate by town

ZN      

Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS

Proportion of non-retail business acres per town

CHAS

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX

Nitric oxides concentration (parts per 10 million)

RM

Average number of rooms per dwelling

AGE

Proportion of owner-occupied units built prior to 1940

DIS

Weighted distances to five Boston employment centers

RAD

Index of accessibility to radial highways

TAX     

Full-value property-tax rate per $10,000

PTRATIO

Pupil-teacher ratio by town

B

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT   

% Lower status of the population

MEDV    

Median value of owner-occupied homes in $1000's

  1. Open Boston_Housing.xls from the datasets folder. The figure below shows the data; observe the last column (CAT.MEDV) in the figure below which we have created for classifying the median house for each row (tract). This is a discretized form of the variable MEDV, by just using '1' for high values of MEDV and 0 for low values.

  2. Click on the XLMiner™ menu and select Partition Data --> Standard Partition. Use the following settings.

    Partition data output is shown in the figure below

  3. From the XLMiner™ menu, select Classification -> Classification Tree. Select from CRIM to LSTAT and move them to the Input variables box, select CAT.MEDV as the output variable and click on the Next button to proceed. Observe that MEDV is not selected for the run because we will use, for output, the variable CAT.MEDV derived from MEDV. 

  4. In the next dialog box select the Normalize input data option (though this could also be left unchecked; normalizing the data only makes a difference if linear combinations of the input variables are used for splitting).  Select Prune on validation to prune the tree (this will reduce error from over-fitting a complex tree to the idiosyncrasies of the training data).  Select Next.

  5. In the third dialog box of Classification tree we have options for selecting the graphics. Select Full tree, Minimum error tree and Best pruned tree. In the score model options, choose the options as displayed below. Then click on the Finish button. 

    Score training data:  Select this option to show an assessment of the performance of the tree in classifying the training data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score validation data:  Select this option to show an assessment of the performance of the tree in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score Test Data:  The options in this group let you apply the model for scoring to the test partition (if one had been created earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data. 

    Score new Data:  The options in this group let you apply the model for scoring to an altogether new data. Specify where the new data is located. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

  6. Output of classification tree is displayed on separate sheet and various sections of output can be viewed by using Output Navigator. Figures below show output sheet and classification tree for the above example.  

    Full Tree  

    How to read the tree:

    Recall that the objective is to classify each case as a 0 (low median value) or a 1 (high median value).  Consider the top node:  the label beneath it indicates the variable represented at this node (i.e. the variable selected for the first split) -- in this case, RM = average number of rooms per dwelling.  The value inside the node indicates the split threshold -- the value that splits the records entering that node, or 6.733 in this case.  Any record where RM <= 6.733 goes to the left; there were 241 of those.  We can think of these as tentatively classified as "0" (low median value).   Any record where RM > the split threshold goes to the right; there were 63 of those.  They are tentatively classified as "1" (high median value).  The 241 going to the left (the ones with fewer rooms on average) then get split up further according to LSTAT (percent of the population that is of lower socioeconomic status).  74 of them fall below the split, 9.535, and are tentatively classified as "1" (they have low percentages of the population with lower socioeconomic status).  The other 167 fall above the split and are classified as a "0."

    A square node indicates a terminal node, after which there are no further splits.  For example, the 167 coming out of LSTAT are classified as 0's and that is the end of the road for them.  The percentage of the records so classified is indicated inside the square.  The path which got them classified as 0's was

    "If few rooms, and if high percent of the population is of lower socioeconomic status, then classify as 0 (low median value)."

    The full tree is not drawn as it will be very large. The structure of the full tree will be clear by reading the Full Tree Rules.

    Minimum error tree

    The minimum error tree is based on applying to the validation data (202 records here) the variable splitting rules developed with the training data.  The misclassification (error) rate is measured as the tree is pruned back, and the tree that produces the lowest error rate is selected.

     

    Best Pruned Tree  

    Note:  The Best Pruned Tree is based on the validation data set, and is the smallest tree whose misclassification rate is within one standard error of the misclassification rate of the Minimum Error Tree. In this example the Best Pruned tree and the Minimum Error Tree happen to be the same because the #Decision Nodes for them is the same. (Please refer to the Prune Log). However, you will often find that Best Pruned Tree has less #Decision Nodes than the Minimum Error Tree. 

Training Log

The training log, below, shows the misclassification (error) rate as each additional node is added to the tree.  Starting off at 0 nodes, the full data set, all records would be classified as "low median value" (0).

Confusion Matrix

A confusion matrix, below, shows counts for cases that were correctly and incorrectly classified in the validation data set.  The 1 in the lower left cell, for example, shows that there was 1 case that was classified as 1 that was actually 0. 

CT_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

See also