Contents

Multiple Linear Regression

Example:

Data Size: Different versions of XLMiner™  have varying limits on size of data. The size of data depicted in the example below may not be supported by your version. Refer to Data Handling Specifications for details.

We will use the Boston Housing data set to predict the median house price in housing tracts in the Boston area.  This data set has 14 variables and a description of each variable is given in the table below. In addition to these variables, the data set also contains an additional variable, which has been created by categorizing median value (MEDV) into two categories -- high and low.

 

CRIM

Per capita crime rate by town

ZN      

Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS

Proportion of non-retail business acres per town

CHAS

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX

Nitric oxides concentration (parts per 10 million)

RM

Average number of rooms per dwelling

AGE

Proportion of owner-occupied units built prior to 1940

DIS

Weighted distances to five Boston employment centers

RAD

Index of accessibility to radial highways

TAX     

Full-value property-tax rate per $10,000

PTRATIO

Pupil-teacher ratio by town

B

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT   

% Lower status of the population

MEDV    

Median value of owner-occupied homes in $1000's

  1. Open Boston_Housing.xls from the datasets folder. The figure below shows the data.

  2. Click on the XLMiner™ menu and select XLMiner™--> Partition Data --> Standard Partition. Use the following settings.

    Partition data output is shown in the figure below

  3. From the XLMiner™ menu, select Prediction --> Multiple Linear Regression.  The Multiple Linear Regression dialog box appears: 

    Data Partition1 is selected by default; keep this selection. In the Input data section scroll down to the bottom and select MEDV as the output variable. Select all other variables and move them to the Input variables box. Click on Next to proceed to the next dialog box. 

    Variables: This box displays all the available variables for processing. For Multiple Linear Regression, this version of XLMiner™ requires all independent variables to be numeric, and they are treated as continuous variables. Any variables (data columns) containing any non-numeric value will result in XLMiner™ ignoring that complete column or variable. For this example, select MEDV as the output variable and move it via the ">" to the "Output Variables" box. Move all other variables except CATMDEV to the "Input Variable" box.

    Weight: If the data set were such that it often had multiple records (cases) with the same values on all variables, it can be expressed more compactly by using another variable to record how many cases share a common set of values on the other variables.  Since that is not the case in this example, a weight variable is not used here.  

    In a situation where the the number of rows in the data are less than the number of variables selected,  XLMiner™ displays the following message:

    -

    On selecting Yes, XLMiner™ leads us directly to the Best Subset dialog (See below in 7.)

  4. The next dialog box of multiple linear regression is shown below and various options are explained. For this exercise select all outputs to be displayed as shown in the figure below:

    Force constant term to zero : If this option is checked, there will be no constant term in the equation. We leave the option unchecked for this example.

    ANOVA Table: When this checkbox is checked, the ANOVA table is displayed in the output.

    Fitted values: When this checkbox is checked, the fitted values are displayed in the output.

    Variance-Covariance matrix: When this checkbox is checked the variance-covariance matrix of the estimated regression coefficients is displayed in the output.

    Score on validation data:  Check this box to show the scores that result from applying the model to the validation data.

    Score Test/New Data:  Select the appropriate option for applying the model to test data, new data, or neither. See the Example of Discriminant Analysis for detailed instructions on this. 

    Score New data in database : See the Example of Discriminant Analysis for detailed instructions on this. 

  5. Display of Residuals

     

    Unstandardized: When this checkbox is checked the Unstandardized Residuals are displayed in the output. Unstandardized residuals are computed by the formula 

    Unstandardized residual = Actual response - Predicted response

    Standardized: When this checkbox is checked the Standardized Residuals are displayed in the output. Standardized residuals are obtained by dividing the unstandardized residuals by the respective standard deviations.

  6. Advanced: 

    Clicking on the Advanced tab in the above dialog the following dialog appears:

    Studentized: When this checkbox is checked the Studentized Residuals are displayed in the output. Studentized residuals are computed by dividing the unstandardized residuals by quantities related to the diagonal elements of the hat matrix , using a common scale estimate computed without the ith case in the model. These residuals have t - distributions with ( n-k-1) degrees of freedom, so any residual with absolute value exceeding 3 usually requires attention.

    Deleted: When this checkbox is checked the Deleted Residuals are displayed in the output. The residual for the ith observation is obtained by fitting the model with the ith observation omitted, using the model to predict the ith observation and then computing the difference from the actual ith observation.

    Cook's Distance: When this checkbox is checked the Cook's Distance for each observation is displayed in the output. This is an overall measure of the impact of the ith datapoint on the estimated regression coefficient. In linear models Cook's Distance has, approximately, an F distribution with k and (n-k) degrees of freedom.

    DF fits: When this checkbox is checked the DF fits (change in the regression fit) for each observation is displayed in the output. These reflect coefficient changes as well as forecasting effects when an observation is deleted.

    Covariance Ratios: When this checkbox is checked the covariance ratios are displayed in the output. This measure reflects the change in the variance-covariance matrix of the estimated coefficients when the ith observation is deleted.

    Hat matrix Diagonal: When this checkbox is checked the diagonal elements of the hat matrix are displayed in the output. This measure is also known as the leverage of the ith observation.

    Collinearity Diagnostics: When this checkbox is checked the collinearity diagnostics are displayed in the output.

    Number of collinearity components: Enter the number of collinearity components. This number can be between 2 and the number of degrees of freedom for the model. When the model is fitted without an intercept, the model degrees of freedom is equal to the number of predictors in the model. When the model is fitted with an intercept, the model degrees of freedom is equal to the number of predictors in the model plus one.

    Multicollinearity Criterion: Enter a value between 0 and 1.

  7. Best Subset:

    When you have a large number of predictors and would like to limit the model to those that matter the most, use this option to select the best subset of predictor variables. In this example, we will ask XLMiner™ for a model upto 13 variables, and we will have XLMiner™ show us the 3 best such subsets of variables.

    Check "Perform best subset selection". XLMiner™ performs and displays the Best subset selection.

    Maximum size of Best subsets: Specify here the maximum size of the best subset.  (The best subset produced by XLMiner™ could be smaller.)

    Number of best subsets: Specify here the number of subsets to be shown.  XLMiner™ will first show the best, then the next-best, etc., and will show this number of subsets for subsets of one variable, subsets of two variables, etc., on up to subsets of the size you specified above. The user can choose this number upto 20.

    Selection Procedure

    • Backward elimination:  Variables are eliminated one at a time, starting with the least significant. We select this procedure for this example.
    • Forward selection:  Variables are added one at a time, starting with the most significant.
    • Exhaustive search:  Searches all combinations of variables for the best fit (can be quite time-consuming, depending on the number of variables). 

    • Sequential replacement:  For a given number of variables, variables are sequentially replaced and replacements that improve performance are retained.

    • Stepwise selection:  Like forward selection, but at each stage, variables can be dropped or added.  

    FIN, FOUT: FIN and FOUT become relevant only when the step-wise selection procedure is selected. In adding and eliminating variables, an F-like statistic is calculated for regression.  For a variable to come into the regression, the F-like value must be greater than FIN (the default is 3.84).  For a variable to leave the regression, the F-like value must be less than FOUT (the default is 2.71).  The value you set for FIN must be greater than the value you set for FOUT. 

    Note:  If the constant term is forced to zero then all models reported by the best subset procedure will omit the constant term.

  8. Click on Finish, and the output of multiple linear regression is presented on a new Excel sheet. 

    In addition to standard regression output, XLMiner™ shows the best subsets -- one row for each model, showing the variables included in that model.

    Of primary interest in a data-mining context will be the predicted and actual values for each record, along with the residual (difference), shown here for the training data set:

    XLMiner™ also displays Total sum of squared errors summaries for the training and validation data sets. The total sum of squared errors is the sum of the squared errors (deviations between predicted and actual values) and the root mean square error (square root of the average squared error).  Average error is typically very small, because positive prediction errors tend to be counterbalanced by negative ones.

    A variety of residual and collinearity diagnostics output is available if it was selected earlier in the setting; this is useful for advanced users and is explained above.

    Choose Subset :

    XLMiner™ displays a list of different models it generated using the Best Subset selection it has performed. Since we have selected 13 as the maximum size of subsets, it displays a list of Best Subset models created using 1 to 13 variables.

    See the Best Subset output above. Every model includes a constant term (since we have not selected "Force constant term to zero" in Step 2 of 2) and one or more variables as the other coefficients. We can use any of these models for further analysis by clicking on the respective link "Choose Subset". The choice of model depends on the calculated values of various error values and the probability. The error values calculated are :- RSS --> The residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0), Cp --> Mallows Cp (Total squared error) is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum, R-Squared --> R-squared Goodness-of-fit, Adj. R-Squared --> Adjusted R-Squared values. "Probability" is a quasi hypothesis test of the proposition that a given subset is acceptable; if Probability < .05 we can rule out that subset.

    XLMiner™ gives a list of variables present in that particular Choose Subset selection on which we place the grabber hand. When we move the grabber hand way down in the Best Subset output sheet shown above, we see the following.

    The value of RSS reduces to a near minimum as we go from 11 coefficients to 12. Cp also attains the minimum value. So these 12 coefficients (a constant and 11 variables) may be sufficient to fit a regression. Select the link of first occurrence of 12 coefficients. This will enable the dialog - Multiple Linear Regression - Step 1 of 2 with these selected 11 variables as the input variables. We can then go ahead with steps to run Multiple Linear Regression utility.

    MLR_Stored_1 : XLMiner™ generates this sheet along with the other outputs. Please refer to the Stored Model Sheets for details.

    Lift charts : Lift charts are visual aids for measuring model performance. They consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

    Method of drawing : After the model is built using the training data set, the model is used to score on the training data set and the validation data set (if exists). Then the data set(s) are sorted using the predicted output variable value (or predicted probability of success in the logistic regression case). After sorting, the actual outcome values of the output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated value. The baseline is drawn as number of cases versus the average of actual output variable values multiplied by the number of cases. The decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's average output variable value.

     

See also: