Contents

Partition With Oversampling

Partition with oversampling

This method of partitioning is used when the percentage of successes in the output variable is very low in the dataset but we want to train the data with a particular percentage of successes.. Oversampling is executed as follows :-

  1. XLMiner™ partitions the data by taking 50% of the success values randomly in the training set. It achieves this by randomizing internally. It uses the output variable we have selected for this. The output variable we select should have only two classes. These classes can be numbers or strings.

  2. It then checks the "% success in training set" specified by you and maintains it in the training set by selecting the required records with failures randomly. 

  3. The remaining 50% of successes go to the validation set. XLMiner™ selects appropriate number of records with failures randomly so that the success percentage in validation set matches that of the original data set. 

  4. If you specify "% validation data to be taken away as test data" then XLMiner™ creates appropriate test set out of the validation set.

For illustration we take the data set, Catalog_multi.xls. It contains a response to a direct mail offer, published by DMEF, the Direct Marketing Educational FoundationIn this "Target dependent variable:buyer(yes=1)" is the output variable. The success rate is less than 1%. In some applications we prefer to train the data with around 50% success rate so we use the oversampling utility, let's see how.

Open Catalog_multi.xls. Invoke XLMiner --> Partition Data --> Partition with Oversampling.  You get the following dialog.


Data Range
The data range to be used should be specified in the Data Range box. Either type the address directly in this box, or using the reference button, mark the required data range from the worksheet. If the cell pointer (active cell) is already somewhere in the data range, XLMiner™ automatically picks up the contiguous data range surrounding the active cell.

First Row Contains Headers
When this box is checked, XLMiner™ picks up the headers from first row of the selected data range. When the box is unchecked, XLMiner™ follows the default naming convention, i.e., the variable in the first column of the selected range will be called "Var1", the second column "Var2," etc.

Variables : This box lists all the variables present in the dataset.

Variables in the partitioned data: This list box contains the names of the variables that you selected from the Variables list.

Randomization options : Check "Set Seed" and enter the desired number. 

Select all the variables under the variables list and transfer them to Variables in the partitioned data by clicking on the transfer button.

We can now select the output variable. Click on Target dependent variable:buyer(yes=1) in the list. As soon as you select it, you will see the selection button under "Output variable" activated. Click on it. This helps selecting the output variable. Remember, the output variable should have only two distinct classes.

The options under "Output options" show the values relevant to the output variable we have chosen.

#Classes : This shows how many classes (distinct values) are present in the output variable.

Specify Success class : You can select which value of the output variable you want as success. Immediately, XLMiner™ shows its percentage in the data set in front of % of success in data set.

Specify % success in training set : You can select what percentage of success you will like the Training set to have. XLMiner™ will select those many successes randomly and select the remaining failures randomly.

Once the training set is made, XLMiner™ attributes the remaining successes to validation set randomly, combining as many failures randomly as to maintain the % of successes same as that of the original data set.

Specify % validation data to be taken away as test data : Here we specify what %ge of validation will be used as test set, if we want a test set.

Let us make the following selection :

Select Ok. We get the following output.

Let us see how XLMiner™ arrived at the #records that we see in the oversampling sheet. If you take a look at the data set, the output variable contains 576 1's. We have taken 1 as the success class. We have specified 50% successes in the training set so XLMiner™ takes 50% of 1's randomly in the partition and takes other 50% of 0's. So the training set has 576 records.

The oversampling sheet above shows that the %ge of successes in the data set is 0.989605704. XLMiner™ maintains this percentage in the validation set. It allocates the remaining 1's ( 50% of 576, ie. 288 in all) randomly to the validation set. Then it selects so many 0's so that the success percentage in validation set is same as that in the original data set. If we calculate the total #records in validation set, it comes to 29102. On allotting 50% of it to the test set, each of them has 14551 rows.