Contents

 

Logistic Regression

 

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

  1. Open the file Charles_BookClub.xls. This file contains information about individuals who are members of a book club (M= money spent, R= months since last purchase, F=  number of purchases, FirstPurch = months since first purchase, ChildBks = number of children's books purchased, etc.). We will try to develop a model for predicting whether a person will purchase a book about Florence.

  2. In XLMiner™, select Partition data --> Standard Partition and move all the variables into the "Variables in the partitioned data" box. Select Specify percentages and specify 70% for training data and 30% for validation data (for this sample we will not use any test data) and click the finish button.  

  3. In XLMiner™, select Classification -->  Logistic regression  

    Data Partition1 is selected by default; keep this selection. In the Input data section scroll down to the bottom and select Florence as the output variable. Since Seq# and ID# are unique identifiers and have no meaning in classification we will not use them; select all other variables and move them to the Input variables box. Click on Next to proceed to the next dialog box. 

    Maintain the default options to specify "success" class and initial cutoff probability value.

     

  4. The second dialog box of logistic regression contains several options for the procedure.

    Force constant term to zero:  Selecting this causes the constant term to be omitted from the regression.

    Set confidence level for odds:  Use this to alter the level of confidence for the confidence intervals displayed in the results for the odds ratio.

    Advanced Computational Settings

    Click "Advanced" and you see the following dialog.

    Maximum number of iterations:  Estimating the coefficients in logistic regression requires an iterative non-linear maximization procedure. You can specify a maximum number of iterations to prevent the program from getting lost in very lengthy iterative loops. The default is set at 50. Keep it as it is.

    Initial Marquardt overshoot factor:  This overshoot factor is a part of the iterative non-linear maximization procedure. Reducing it speeds the operation by reducing the number of iterations required, but increases the chances that the maximization procedure will fail due to overshoot. Set this factor to 1.

    Collinearity Diagnostics:  Sometimes, variables are highly correlated with one another, and this can result in large standard errors for the affected coefficients.This diagnostic display provides information useful in dealing with this problem. Check this option and choose the number of collinearity components to be 2.

    Click OK to go back to step 2 of 3. Select "Best Subset".

    Best Subset:  Often, a subset of variables (instead of all variables) does the best job of classification. Selecting Best Subset in the above dialog box brings up the Best Subset dialog box:

              The Best Subset dialog box:

    Maximum size of best subset: Using the spin buttons, specify here the maximum size of the best subset to 15. (The best subset produced by XLMiner™ could be smaller.) 

    Number of best subset : Keep the number of best subsets to the default 15. You can select any number upto 20.

    Selection Procedures 

    • Backward elimination:  Variables are eliminated one at a time, starting with the least significant.
    • Forward selection:  Variables are added one at a time, starting with the most significant.
    • Exhaustive search:  Searches all combinations of variables for the best fit (can be quite time-consuming, depending on the number of variables). 

    • Sequential replacement:  For a given number of variables, variables are sequentially replaced and replacements that improve performance are retained.

    • Stepwise selection:  Like forward selection, but at each stage, variables can be dropped or added.  

    FIN, FOUT:  In step-wise selection, in adding and eliminating variables, an F-like statistic is calculated for the regression. For a variable to come into the regression, the F-like value must be greater than FIN (the default is 3.84). For a variable to leave the regression, the F-like value must be less than FOUT (the default is 2.71). The value you set for FIN must be greater than the value you set for FOUT.

     For this exercise we will stay with the default options. Click on OK, then click on Next in the prior dialog box, and proceed to the third dialog box. 

     

  5. Check the appropriate options in the third step.

    Covariance matrix of coefficients:  This option causes the coefficient covariance matrix to be displayed with the output. Entries in the matrix are the covariances between the indicated coefficients.  The "on-diagonal" values are the estimated variances of the corresponding coefficients.

    Residuals:  Produces a two-column array of fitted values and residuals.

    Score training data:  Select this option to show an assessment of performance in classifying the training data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score validation data:  Select this option to show an assessment of performance in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary and Lift charts.

    Score Test Data:  The options in this group let you apply the model for scoring to the test partition (if one had been created earlier). The option "Score Test Data" is available only if the dataset contains test partition. Select it to apply the model to test data. 

    Score new Data:  The options in this group let you apply the model for scoring to an altogether new data. Specify where the new data is located. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

  6. Click Finish, and the logistic regression output is displayed on new sheet. Use the output navigator to see various sections of the output.

  7. A number of sections of output are available, including Classification of the Training Data below. Note that XLMiner™ has not, strictly speaking, classified the data -- it has assigned a "predicted probability of success" to each case. This is the predicted probability, based on the input (independent) variable values for a case, that the output (dependent) variable for the case will be a "1". Since the logistic regression procedure works not with the actual values of the variable but with the logs of the odds ratios, this value is shown in the output (the predicted probability of success is derived from it).

    To classify each record as a "1" or a "0," we would simply assign a "1" to the record if the predicted probability of success exceeds a certain value, we have set this initial cutoff probability as 0.5 here,  a "0" otherwise. This could be handled in Excel with another column containing an IF statement.

    If we selected best subsets, then XLMiner™ will make available the following output, which shows the variables that are included in the subsets. Since we selected 15 as the size of the subset, we are shown the best subset of 1 variable (plus the constant), upto the best subset for 15 variables (plus the constant). It is actually a list of different models XLMiner  generated using the Best Subset selection it has performed.

    Choose Subset :

    See the Best Subset output above. Every model includes a constant term (since we have not selected "Force constant term to zero" in Step 2 of 2) and one or more variables as the other coefficients. We can use any of these models for further analysis by clicking on the respective link "Choose Subset". The choice of model depends on the calculated values of various error values and the probability. RSS is the residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0). Cp is "Mallows Cp" and is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum. "Probability" is a quasi hypothesis test of the proposition that a given subset is acceptable; if Probability < .05 we can rule out that subset.

    XLMiner™ gives a list of variables present in that particular Run Subset selection on which we place the grabber hand. When we move the grabber hand way down in the Best Subset output sheet shown above, we see the following.

    The considerations about RSS, Cp and Probability would lead us to believe that the subsets with 10 or 11 coefficients are the best models in this example. If we select a link "Choose Subset" on one of these models we will get the dialog - Logistic Regression - Step 1 of 3. In that we will find the input variables to be the ones included as coefficients in the respective Best Subset model.

    Model terms are shown in the Regression Model output;  it contains the coefficient, the standard error of the coefficient, the p-value and the odds ratio for each variable (which is simply ex where x is the value of the coefficient) and confidence interval for the odds. Summary statistics to the right show the residual degrees of freedom (#observations - #predictors), a standard deviation type measure for the model (which typically has a chi-square distribution), the percentage of successes (1's) in the training data, the number of iterations required to fit the model, and the Multiple R-squared value.

    The multiple R-squared value shown here is the r-squared value for a logistic regression model , defined as -

    R2 = (D0-D)/D0

    where D is the Deviance based on the fitted model and D0 is the deviance based on the null model. The null model is defined as the model containing no predictor variables apart from the constant.

Collinearity Diagnostics help assess whether two or more variables so closely track one another as to provide essentially the same information. The columns represent the variance components (related to principal components in multivariate analysis), while the rows represent the variance proportion decomposition explained by each variable in the model. The eigenvalues are those associated with the singular value decomposition of the variance-covariance matrix of the coefficients, while the condition numbers are the ratios of the square root of the largest eigenvalue to all the rest. In general, multicollinearity is likely to be a problem with a high condition number (more than 20 or 30), and high variance decomposition proportions (say more than 0.5) for two or more variables.

LR_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

See also