| Contents |
Standard Data Partition
| Data
Partitioning
Most data mining projects use large volumes of data. Before building a model, typically you partition the data using a partition utility. Partitioning yields mutually exclusive datasets: a training dataset, a validation dataset and a test dataset. Training Set The training dataset is used to train or build a model. For example, in a linear regression, the training dataset is used to fit the linear regression model, i.e. to compute the regression coefficients. In a neural network model, the training dataset is used to obtain the network weights. Validation Set Once a model is built on training data, you need to find out the accuracy of the model on unseen data. For this, the model should be used on a dataset that was not used in the training process -- a dataset where you know the actual value of the target variable. The discrepancy between the actual value and the predicted value of the target variable is the error in prediction. Some form of average error (MSE of average % error) measures the overall accuracy of the model. If you were to use the training data itself to compute the accuracy of the model fit, you would get an overly optimistic estimate of the accuracy of the model. This is because the training or model fitting process ensures that the accuracy of the model for the training data is as high as possible -- the model is specifically suited to the training data. To get a more realistic estimate of how the model would perform with unseen data, you need to set aside a part of the original data and not use it in the training process. This dataset is known as the validation dataset. After fitting the model on the training dataset, you should test its performance on the validation dataset. Test Set The validation dataset is often used to fine-tune models. For example, you might try out neural network models with various architectures and test the accuracy of each on the validation dataset to choose among the competing architectures. In such a case, when a model is finally chosen, its accuracy with the validation dataset is still an optimistic estimate of how it would perform with unseen data. This is because the final model has come out as the winner among the competing models based on the fact that its accuracy with the validation dataset is highest. Thus, you need to set aside yet another portion of data which is used neither in training nor in validation. This set is known as the test dataset. The accuracy of the model on the test data gives a realistic estimate of the performance of the model on completely unseen data. Standard Partition XLMiner™ provides two ways of standard partitioning: random partitioning and user-defined partitioning.
Random partitioning uses the system clock as a default to initialize the random number seed. Alternatively, you can specify any non-negative seed for random number generation for the partitioning. If you check the "set seed" option, the default seed entry is 12345. Using the same seed allows you to replicate the partitions exactly for different runs. Let us see how the partition utility works on the dataset Wine.xls. Data Size: Different versions of XLMiner™ have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.
The figure below shows the partition dialog box using random sampling. The "set seed" box is checked, the default user-specified value is left at '12345,' training data comprises 60% of the total dataset, and validation data 40%. These are the defaults for partitioning. To use other percentages, click on "Specify-percentages" and enter the desired percentages for partitioning
The dialog box indicates the name of the dataset, the worksheet that contains the dataset ,#rows and the #columns in it. Data Range First Row Contains Headers Variables Variables in the Partitioned Data This list box contains the names of the variables that you selected from the Variables list. Partition Options User can pick up rows randomly (as shown here). Check "Set Seed" and enter the desired number. Partitioning is also possible by using a partition variable, as shown later. Partitioning percentages when picking up rows randomly XLMiner™ provides three options here -- Automatic : If you select this, 60% of the total number of records in the dataset are assigned randomly to the training set and the rest to the validation set. If the data set is large then 60% will perhaps exceed the limit on number of records in the training partition. (Check the data handling specifications.) In that case, XLMiner™ will allocate such a maximum percentage to the training set that will be just within the limits. It will then assign the remaining percentage to the validation set. eg. If the limit on training partition is 10,000 records and the data set has 60,000 records then XLMiner™ will assign 16.66% records to training set. The remaining 83.34% records will be included in the validation set. Specify percentages : You can specify the required partition percentages here. In case of large data sets XLMiner™ will suggest the maximum possible percentage to training set, so that the training partition is within the specified limits. It will then allocate the remaining records to validation and test sets in the proportion 60:40. You can change these and specify percentages. XLMiner™ will execute your specifications as long as the limits are met. Equal records in training, validation and test set : XLMiner™ will divide the records equally in training, validation and test sets. If the data set is large, it will assign maximum possible records to training so that the number is within the specified limit for training partition and assigns the same percentage to validation and test sets. This means, all the records may not be accommodated. So in case of large data sets use specify percentages if required. The output of a data partition run is shown in the figure below:
XLMiner™ allots Row Ids to records in the sequence they appear in the dataset. It also sorts the output on Row Id before displaying it. Figure below shows the dataset Wine_Partition.xls where we have added "Partition Variable" column.
The figure below shows the dialog box where "Partition Variable" has been selected for partitioning of the dataset.
The output of a data partition run, using a partition variable, is shown in the figure below:
|