Regression Tree
|
Example: Data Size: Different versions of XLMiner™ have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details. We will use the Boston Housing data set to predict the median house price in housing tracts in the Boston area.
Data Range : Specify the range of input data used for partitioning here. XLMinerTM automatically picks the active data range, or you can select by either typing or selecting the desired range. Variables: This box lists all the variables present in the data set. If "First row contains headers" is checked, the header row above the data is used to find variable names. Partitioning options: Leave the "Pick up rows randomly" box checked (if you have a variable in the data set designating which partition a row is assigned to, you could check the "Use partition variable" option). Change the seed to 111. Partitioning percentages...: Select Specify percentages. Set the Training Set to 50%, Validation Set to 30% and the Test Set at 20%. Click on "OK" and the Data Partition sheet is displayed to show the results of partitioning. Note the links at the top to the two partitioned data sets:
The Prune log shows the residual sum of squares (RSS) at each stage of the tree for both training and validation data sets. This is the sum of squared residuals (difference between predicted and actual). You can see in the prune log, that the validation RSS goes on reducing as the tree goes on splitting. The cost complexity is calculated at each step. Cost Complexity Factor is the parameter that governs how far back the tree should be pruned. XLMiner™ decides the #decision nodes for the pruned tree and minimum error tree from the values of RSS and cost complexity. In the Prune log shown below, Validation RSS goes on reducing till the #Decision Nodes increases form 0 to 5, then starts increasing. So the Minimum Error and Best Pruned Tree show 5 decision nodes each, as shown below.
The real test is how well this tree does (i.e. how well these rules do) with the validation data, which were not used in training the model. Here are the predicted validation data values:
We will look at the pruned tree below to understand how to read it. "Pruning" involves chopping off nodes from the bottom up so that there are fewer and fewer branches on the tree. This option causes the regression tree to be pruned thereby reducing error from over-fitting a complex tree to the idiosyncrasies of the validation data. We can read this tree as follows. LSTAT (% of the population that is lower status) is chosen as the first splitting variable; if this percentage is > 7.82% (106 cases), then LSTAT is again chosen for splitting. Now, if LSTAT <=14.895 (51 cases) then MEDV is predicted as $21.64. So our first rule is if LSTAT >7.82 and LSTAT<=14.895 then MEDV=$21.64. If LSTAT <= 7.82%, then we move to RM (Average No. of rooms per dwelling) as the next divider. If RM >7.445 (9 cases), MEDV for those cases is predicted to be $43.11. ($43.11 is another terminal node). So our second rule is "If LSTAT <= 7.82 AND RM >7.445, then MEDV = $43.11." The output also contains summary reports on both the training data and the validation data. These reports contain the total sum of squared errors, the root mean square error (RMS error, or the square root of the mean squared error), and also the average error (which is much smaller, since errors fall roughly into negative and positive errors and tend to cancel each other out unless squared first.) RT_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details Lift charts : Method of drawing :
See also |