Contents

 

Discriminant Analysis

 

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

  1. Open file Wine.xls in Microsoft Excel.

  2. In XLMiner™, select Partition data option, move all variables to the "Variables in the partitioned data" box, specify the percentage as shown. 

    Data Range: Specifies the range of input data used for partitioning. XLMiner™ automatically picks active data range.  You can also enter the range address, or select it with the mouse.

    Variables: This box lists all the variables present in the dataset. If the "First row contains headers" box is checked, the header row above the data is used to identify variable names.

    Select all the variables. Click on Finish.  

  3. In XLMiner™ menu click on classification --> Discriminant Analysis to get the first dialog box of discriminant analysis. Select type as output variable and remaining variables as input variables. Click on Next button to proceed. Figure below shows dialog box with selection and explains various options available.

    Data Partition1 is selected by default; keep this selection. In the Input data section scroll down to the bottom and select Type as the output variable. Select all other variables and move them to the Input variables box. Click on Next to proceed to the next dialog box. 

     

  4. In Prior class probabilities use the default option and click on Next. 

     

    Calculate according to relative occurrences:  The discriminant analysis procedure incorporates prior assumptions about how frequently the different classes occur.  If this option is checked, it will be assumed that the probability of encountering a particular class in the large data set  is the same as the frequency with which it occurs in the training data.

    Use equal prior probabilities:  If this option is checked, it will be assumed that all classes occur with equal probability.

  5. In the third dialog box, select the appropriate options of scoring.

     

    Canonical variate loadings: XLMiner™ produces the canonical variates for the data which is based on an orthogonal representation of the original variates. This has the effect of choosing a representation which maximizes the distance between the different groups. For a k class problem there are k-1 Canonical variates. Very often only a subset of the canonical variates is sufficient to discriminate between the classes. For our problem we have two canonical variates. This means if we replace the four original predictors by just two predictors X1 and X2, (which are actually linear combinations of the four original predictors) the discrimination based on these two predictors will be as good as the one based on original predictors. Check the option for canonical variate loadings.

    Canonical Scores: The values of the variables X1 and X2 for the ith observation are known as the canonical scores for that observation. In our example the pair of canonical scores for each observation represents the observation in a two dimensional space. The purpose of the canonical score is to make separation between the classes as large as possible. Thus when the observations are plotted with the canonical scores as the coordinates, the observations belonging to same class are grouped together. Here we are reporting the scores of the first few observations.

    Score training / validation data:  Check appropriate options to show the scores of training and validation data.

    Score Test/New Data:  Select the appropriate option for applying the model to test data and / or new data as required. For new data, See below.

    Score New data in database : See below in this Example of Discriminant Analysis for detailed instructions on this. 

          Click "Finish" to get the output.

 

        6.  The output is displayed in a separate sheet; various sections of the output can be viewed using the Output Navigator. 

 

     

 

Classification of training data:  This section of the output shows how each training data case was classified. The highest probability values in each record are highlighted.

 

     

Classification Functions form the basis for assigning each case to a class.  In this example, there are 3 functions -- one for each class.  Whichever class's function is the highest for a particular case is the class assigned to that case. 

Canonical Variate Loadings are a second set of functions that give a representation of the data that maximizes the separation between the classes.  The number of functions is one less than the number of classes (so in this case there are two functions).  So, if you were to plot the cases in this example on an x-y plot where xi and yi are the ith case's value for variate1 and variate2, you would see a clear separation of the data.  This output is useful in illustrating the inner workings of the discriminant analysis procedure, but is not typically needed by the end-user analyst. 

 

 

Canonical Scores are the values of each case for each of these functions.  Again, these are intermediate values useful for illustration but not required by the end-user analyst.

 

 

DA_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

 

 

Scoring to database 

XLMiner™ provides a facility to score to a database. We match the input variables with the database fields and scoring is performed on the database fields. This facility is not available in the Education version.

Open the dataset wine.xls and perform steps 1 through 4 above. You get the dialog of step 3 of 3 as follows:

 

 

In "Score new data in" check the box for Database. If you click the pull down button for Data source, the dialog appears as follows.

 

 

Select the data source. The Connect to a database button is activated. On selecting  it, you get different dialogs for different data sources.

 

If you have selected the SQL Server you see the following Login Form.

 

 

Enter the appropriate details and click OK.

 

The Login form for Oracle is as follows:

 

 

If you are working with MS-Access database, you see the following. 

 

 

Click on "Browse for database file" and choose your database. Select OK after opening the database.

Scoring to database follows the same steps thereafter for all the data sources. For illustration, a MS-Access file is shown. We have selected a database, dataset.mdb.

 

 

Select Boston_Housing in "Select table/view". The fields in table and variables in the input data are shown above. We now need to match the variables in our dataset to the various fields of our database. 

  1. Select  "Match the first 13 variables in the same sequence". This will match all the 13 variables of our wine.xls with the first 13 variables of the database. Or,

  2. You can do mapping yourself. Select the matching variables -- one from the database and one from wine.xls. 

     

 

You will notice after selecting the variables, the option " Match CRIM <--> Alco" is activated. Select it and these two variables are matched. Continue this process for all the variables required.

If the dataset has variable names which are same as field names in the database then select the option "Match variables with same name."

 

Output Field : 

  1. You can select the output field from the remaining existing fields of the database.

  2. Select  a field from the list, say, Out. Assign it by clicking the sign in front of "Select output field". Now click OK. The output scores will be stored in the field Out. Or,
  3. Select "Add new field for output". Type a name, say, out_new in the space provided and press OK.

 

    Xlminer™  creates out_new in the database . If you open the database you will be able to see out_new created and scores stored there.

    If you click on Database Score in the output navigator XLMiner™ displays the following.

     

         

Score new data

XLMiner™ can also perform scoring on new data.

Open Digits.xls.. Open Flying_Fitness.xls. Start with the steps of Discriminant analysis on Flying_Fitness as described in the example. Do Step 1 of 3 and Step 2 of 3 as mentioned in the example to get step 3 of 3.

Score new data

Let's see how XLMiner™ can perform scoring on new dataset.  

Check the box "Detailed Report" for scoring the new data in it. The dialog for "Match variables in the new page" appears.  Go to the drop down box in front of "Workbook" and select Digits.xls there.

In the above dialog, The variables in the input data are from Flying_Fitness.xls and that from new data are of Digits.xls. We can match the variables in different ways.

  1. Select  Match the first 5 variables in the same sequence. This will match all the 5 variables of our Flying_Fitness.xls with the first 5 variables of  Digits.xls.

  2. You can do mapping yourself. Select the matching variables -- one from the Flying_Fitness.xls and one from Digits.xls 

 

You will notice after selecting the variables, the option " Match x2 <--> Var2" is activated. Select it and these two variables are matched. Continue this process for all the variables required.

If "Match the first 5 variables in the same sequence" is selected,

 

 

You will see that all the variables in the input data are matched with the variables in the new data.

    Click OK after matching all the variables from the input data.  Click finish on step 3 of 3. See the output.

     

     

    See also