|
CRIM
|
Per capita crime rate by town
|
|
ZN
|
Proportion
of residential land zoned for lots over 25,000 sq.ft.
|
|
INDUS
|
Proportion
of non-retail business acres per town
|
|
CHAS
|
Charles
River dummy variable (= 1 if tract bounds river; 0 otherwise)
|
|
NOX
|
Nitric
oxides concentration (parts per 10 million)
|
|
RM
|
Average
number of rooms per dwelling
|
|
AGE
|
Proportion
of owner-occupied units built prior to 1940
|
|
DIS
|
Weighted
distances to five Boston employment centers
|
|
RAD
|
Index
of accessibility to radial highways
|
|
TAX
|
Full-value
property-tax rate per $10,000
|
|
PTRATIO
|
Pupil-teacher
ratio by town
|
|
B
|
1000(Bk
- 0.63)^2 where Bk is the proportion of blacks by town
|
|
LSTAT
|
%
Lower
status of the population
|
|
MEDV
|
Median
value of owner-occupied homes in $1000's
|
Open Boston_Housing.xls from the datasets folder.
The figure below shows the data; observe the last column (CAT.MEDV) in the figure below which we have
created for classifying the median house for each row (tract). This is a discretized form of the
variable MEDV, by just using '1' for high values of MEDV and 0 for low values.
Click on the
XLMiner™
menu and select Partition
Data --> Standard Partition. Use the following settings.

Partition data output is shown in the figure below

From the
XLMiner™
menu, select Classification
-> Classification Tree. Select from CRIM to LSTAT and move them to the Input variables box, select CAT.MEDV as the output
variable and click on the Next button to proceed. Observe that MEDV is not
selected for the run because we will use, for output, the variable CAT.MEDV
derived from MEDV.

In the next dialog box select the Normalize input data option
(though this could also be left unchecked; normalizing the data only makes a
difference if linear combinations of the input variables are used for
splitting). Select Prune on validation to prune the tree (this will
reduce error from over-fitting a complex tree to the idiosyncrasies of the
training data). Select Next.

In the third dialog box of Classification tree we have options for selecting the
graphics. Select Full tree, Minimum error tree and Best pruned tree. In the
score model options, choose the options as displayed below. Then click on
the Finish
button.

Score training data:
Select this option to show an assessment of the performance of the tree
in classifying the training data. The report is displayed according to
your specifications - Detailed, Summary and Lift charts.
Score validation data:
Select this option to show an assessment of the performance of the tree
in classifying the validation data. The report is displayed according to
your specifications - Detailed, Summary and Lift charts.
Score Test Data: The options in this group let you apply the model for scoring to the test partition (if one had been created
earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data.
Score new Data: The options in this group let you apply the model for scoring to an altogether new data.
Specify where the new data is located. See
the Example of Discriminant Analysis for detailed instructions on this.
Score New data in database : See
the Example of Discriminant Analysis for detailed instructions on this.
Output of classification tree is displayed on separate sheet and various
sections of output can be viewed by using Output Navigator. Figures below
show output sheet and classification tree for the above example.

Full
Tree

How to read the tree:
Recall that the objective is to classify each case as a 0 (low median value)
or a 1 (high median value). Consider the top node: the label beneath
it indicates the variable represented at this node (i.e. the variable selected
for the first split) -- in this case, RM = average number of rooms per
dwelling. The value inside the node indicates the split threshold -- the
value that splits the records entering that node, or 6.733 in this case. Any record where RM
<= 6.733 goes to the left; there were 241 of those. We can think of
these as tentatively classified as "0" (low median
value). Any
record where RM > the split threshold goes to the right; there were 63 of
those. They are tentatively classified as "1" (high median
value). The 241 going to the left (the ones with fewer rooms on average)
then get split up further according to LSTAT (percent of the population that is of
lower socioeconomic status). 74 of them fall below the split, 9.535, and
are tentatively classified as "1" (they have low percentages of the
population with lower socioeconomic status). The other 167 fall above
the split and are classified as a "0."
A square node indicates a terminal node, after which there are no further
splits. For example, the 167 coming out of LSTAT are classified as 0's
and that is the end of the road for them. The percentage of the records
so classified is indicated inside the square. The path which got them
classified as 0's was
"If few rooms, and if high percent of the population is of lower
socioeconomic status, then classify as 0 (low median value)."
The full tree is not drawn as it will be very large. The structure of the full tree will be
clear by reading the Full Tree Rules.

Minimum
error tree
The minimum error tree is based on applying to the validation data (202
records here) the variable splitting rules developed with the training
data. The misclassification (error) rate is measured as the tree is pruned
back, and the tree that produces the lowest error rate is selected.

Best
Pruned Tree
Note: The Best Pruned Tree is based on the validation data set, and is the
smallest tree whose misclassification rate is within one standard error of the
misclassification rate of the Minimum Error Tree. In this example the Best
Pruned tree and the Minimum Error Tree happen to be the same because the #Decision Nodes
for them is the same. (Please refer to the Prune Log). However, you will often
find that Best Pruned Tree has less #Decision Nodes than the Minimum Error
Tree.

Training Log
The training log, below, shows the misclassification (error) rate as each
additional node is added to the tree. Starting off at 0 nodes, the
full data set, all records would be classified as "low median value" (0).

Confusion Matrix
A confusion matrix, below, shows counts for cases that were correctly and
incorrectly classified in the validation data set. The 1 in the lower left
cell, for example, shows that there was 1 case that was
classified as 1 that was actually 0.

CT_Stored_1 : XLMiner™ generates this sheet along with the
other outputs. Please refer to the
Stored Model Sheets for details.
Lift
charts :
Lift charts are visual aids for measuring model performance. They consist of
a lift curve and a baseline. The greater the area between the lift curve
and the baseline, the better the model.
Method
of drawing :
After the model is built using the training data set, the model is used
to score on the training data set and the validation data set (if exists).
Then the data set(s) are sorted using the predicted output variable value (or
predicted probability of success in the logistic regression case). After
sorting, the actual outcome values of the output variable is cumulated and the
lift curve is drawn as
number of cases versus the cumulated value. The baseline is drawn as number of
cases versus the average of actual output variable values multiplied by the
number of cases. The decilewise lift curve is drawn as the decile number
versus the
cumulative
actual output variable value divided by the decile's average output variable
value.