Contents

Regression Tree

Introduction

As with all regression techniques we assume the existence of a single output (response) variable and one or more input (predictor) variables. The output variable is numerical. The general regression tree building methodology allows input variables to be a mixture of continuous and categorical variables. (All versions of XLMiner™ support continuous numerical variables; check the descriptive information for your version of XLMiner™ to determine whether it supports categorical variables). A decision tree is generated where each decision node in the tree contains a test on some input variable's value. The terminal nodes of the tree contain the  predicted output variable values. 

Regression tree may be considered as a variant of decision trees, designed to approximate real-valued functions instead of being used for classification tasks.

Methodology 

Regression tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. Initially all of the records in training set (the pre-classified records that are used to determine the structure of the tree) are together in one big box. The algorithm then tries breaking up the data, using every possible binary split on every field. The algorithm chooses the split that partitions the data into two parts such that it minimizes the sum of the squared deviations from the mean in the separate parts.  This splitting or partitioning is then applied to each of the new branches.  The process continues until each node reaches a user-specified minimum node size and becomes a terminal node.  (If the sum of squared deviations from the mean in a node is zero, then that node is considered a terminal node even if it has not reached the minimum size.) 

Pruning the tree 

Since the tree is grown from the training data set, when it has reached full structure it usually suffers from over-fitting (i.e. it is "explaining" random elements of the training data that are not likely to be features of the larger population of data). This results in poor performance on real life data. Therefore, it has to be pruned using the validation data set. XLMiner™ calculates the cost complexity factor at each step during the growing of the tree and decides the #decision nodes for the pruned tree. The cost complexity factor is the multiplicative factor that is applied to the size of the tree (which is measured by the number of terminal nodes). 

The tree is pruned to minimize the sum of (1) the output variable variance in the validation data, taken a terminal node at a time, and (2) the product  of the cost complexity factor and the number of terminal nodes. If the cost complexity factor is specified as zero then pruning is simply finding the tree that performs best on validation data in terms of total terminal node variance. Larger values of the cost complexity factor result in smaller trees. The pruning is done such that the last grown node is chopped off first and so on.

See also