Contents

Sampling from Worksheet

Using the Data Sampling Utility

Let us find out how the Data sampling From Worksheet utility works :

Select XLMiner --> Data Utilities --> Sample from worksheet. The following dialog appears on screen.

Data Range : Either type the address directly in this box, or using the reference button, mark the required data range from the worksheet. If the cell pointer (active cell) is already somewhere in the data range, XLMiner™ automatically picks up the contiguous data range surrounding the active cell. When the data range is selected XLMiner™ displays the number of records in the selected range.

First row contains headers : When this box is checked, XLMiner™ picks up the headers from first row of the selected data range. When the box is unchecked, XLMiner™ follows the default naming convention, i.e., the variable in the first column of the selected range will be called "Var1", the second column "Var2," etc.

Variables : This list box contains the names of the variables in the selected data range. If the first row of the range contains the variable names, then these names appear in this list box. If the first row of the dataset does not contain the headers, then XLMiner™ puts the variable names using its default naming convention. In this case the first column is named Var1, the second column is named Var2 and so on.   To select a variable for sampling, click on it, then click on the ">" button.  Use the ctrl key to select multiple variables.

Sampling Options : 

Sample With replacement : If checked the data is sampled with replacement. The default is sampling without replacement.

Set Seed : The user can enter the sorting seed. The default seed is 12345.

Desired sample size : The user can enter here the sample size as desired. (Note that the actual sample size may vary a little, depending on the further instructions from the user. We will know why as we go ahead.)

Simple random sampling : The data is sorted using the simple random sampling techniques taking into account the other parameters entered earlier.

Stratified random sampling : If this sampling method is to be used XLMiner™ provides the following options.

Stratum Variable : From the list of variables, the user can decide which to use for stratified random sampling. Note that XLMiner™ allows only those variables which have less than 30 distinct values. As the user selects the variable name XLMiner™ displays the no. of strata that variable contains in a box in front and smallest stratum size in a box in front of the option "Equal from each stratum, #records = smallest stratum size".

Pick records from strata :

Proportionate to stratum size : XLMiner™ detects the proportion of each stratum in the dataset and maintains the same in sampling. Due to this XLMiner™ sometimes has to change the desired sample size (on the higher side) in order to maintain the proportionate stratum size. XLMiner™ then displays the actual sample size in the output along with the desired sample size as shown below.

Equal from each stratum : On specifying the no. of records, XLMiner™ generates sample which has that same number of records from each stratum. In this case the number chosen automatically decides the desired sample size, so the option to enter the desired sample size is not available.

Equal from each stratum, #records = smallest stratum size : XLMiner™ detects the smallest stratum size and generates a sample wherein every stratum has a representation of that size. Again here option for the desired sample size is not required.

XLMiner™ performs the stratified random sampling with or without replacement. If you choose sample without replacement,  you can not have the desired sample size greater than the number of records in the dataset, for obvious reasons. While entering the #records in the option "equal from each stratum" do not enter a number which will make the sample size greater than the size of the dataset.

If you choose to sample with replacement, then the no. of records in the dataset does not pose any limit on the no. of records in the sample. XLMiner™ can generate a sample having upto 2000 records. 

Before sampling, XLMiner™ generates a Row ID for each record (irrespective of weather the dataset already has one).  XLMiner™ sorts the sampled output according to Row ID before displaying it.

See also: