Contents

Regression Tree

Example

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

We will use the Boston Housing data set to predict the median house price in housing tracts in the Boston area.  

  1. Open Boston_Housing.xls from the datasets folder. The figure below shows the data.

  2. Click on the XLMiner™menu and select Partition Data --> Standard Partition. Use the following settings. 

  3. Data Range :  Specify the range of input data used for partitioning here. XLMinerTM automatically picks the active data range, or you can select by either typing or selecting the desired range. 

    Variables:  This box lists all the variables present in the data set.  If "First row contains headers" is checked, the header row above the data is used to find variable names.

    Partitioning options:  Leave the "Pick up rows randomly" box checked (if you have a variable in the data set designating which partition a row is assigned to, you could check the "Use partition variable" option). Change the seed to 111. 

    Partitioning percentages...:  Select Specify percentages. Set the Training Set to 50%, Validation Set to 30% and the Test Set at 20%. 

    Click on "OK" and the Data Partition sheet is displayed to show the results of partitioning.  Note the links at the top to the two partitioned data sets:

  4. Having partitioned the data, you are now ready to do the regression. In XLMiner™ click on Prediction --> Regression Tree.  The Regression Tree dialog box appears:

    T

    Data Partition1 is selected by default; keep this selection. In the Input data section scroll down to the bottom and select MEDV as the output variable. Select all other variables and move them to the Input variables box. Click on Next to proceed to the next dialog box. 

  5. The next dialog box provides the following options. Set the options as shown below.

    Normalize input data: Leave Normalize input data option unchecked. (Normalizing the data only makes a difference if linear combinations of the input variables are used for splitting)

    Maximum #of splits for input variable: Enter the number of splits you can allow for input variables.

    Minimum #records in a terminal node:  The tree building will proceed until all the terminal nodes reach this size.

    Scoring options : You can select the option to score on. Let us choose to score using best pruned tree. 

     Select Next.

  6. In the third dialog box of Regression Tree there are options for selecting the graphics to be displayed. Select Full tree and Pruned tree. The figure below shows the dialog box with various selections made.

    Maximum #levels to be displayed : Enter here how many levels you would like the tree to display. 

    Full tree:  Check this box to display the full regression tree using training data.

    Pruned tree:  Check this box to display the pruned tree, pruned using validation data.

    Minimum error tree : Check this box to display minimum error tree, pruned using validation data.

    Score on training /validation data:  Check the boxes to show the scores that result from training and validation data.

    Score Test/New Data:  Select the appropriate options for applying the model to test data and /or new data. See the Example of Discriminant Analysis for detailed instructions for new data. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

    Click on the Finish button. 

  7. The output is displayed on a separate sheet and various sections of the output can be viewed by using the Output Navigator.  Here are the predicted values in the training data. 

  8. The Prune log shows the residual sum of squares (RSS) at each stage of the tree for both training and validation data sets. This is the sum of squared residuals (difference between predicted and actual). You can see in the prune log, that the validation RSS goes on reducing as the tree goes on splitting. The cost complexity is calculated at each step. Cost Complexity Factor is the parameter that governs how far back the tree should be pruned. XLMiner™ decides the #decision nodes for the pruned tree and minimum error tree from the values of RSS and cost complexity. In the Prune log shown below, Validation RSS goes on reducing till the #Decision Nodes increases form 0 to 5, then starts increasing. So the Minimum Error and Best Pruned Tree show 5 decision nodes each, as shown below.

    The real test is how well this tree does (i.e. how well these rules do) with the validation data, which were not used in training the model. Here are the predicted validation data values:

    We will look at the pruned tree below to understand how to read it.  "Pruning" involves chopping off nodes from the bottom up so that there are fewer and fewer branches on the tree. This option causes the regression tree to be pruned thereby reducing error from over-fitting a complex tree to the idiosyncrasies of the validation data. 

 

    We can read this tree as follows. LSTAT (% of the population that is lower status) is chosen as the first splitting variable; if this percentage is > 7.82% (106 cases), then LSTAT is again chosen for splitting. Now, if LSTAT <=14.895 (51 cases) then MEDV is predicted as $21.64.  So our first rule is if  LSTAT >7.82 and  LSTAT<=14.895 then MEDV=$21.64.

    If LSTAT <= 7.82%, then we move to RM (Average No. of rooms per dwelling) as the next divider.  If RM >7.445 (9 cases), MEDV for those cases is predicted to be $43.11. ($43.11 is another terminal node).  So our second rule is "If LSTAT <= 7.82 AND RM >7.445, then MEDV = $43.11."

    The output also contains summary reports on both the training data and the validation data.  These reports contain the total sum of squared errors, the root mean square error (RMS error, or the square root of the mean squared error), and also the average error (which is much smaller, since errors fall roughly into negative and positive errors and tend to cancel each other out unless squared first.) 

    RT_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details

    Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

    Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

    See also