Open Boston_Housing.xls from the datasets folder.
The figure below shows the data.
Click on the
XLMiner™
menu and select
XLMiner™--> Partition Data --> Standard Partition. Use the following
settings.

Partition data output is shown in the figure below

From the XLMiner™
menu, select Prediction
--> Multiple Linear Regression. The Multiple Linear Regression dialog
box appears:

Data Partition1 is selected by default; keep this selection. In the Input
data section scroll down to the bottom and select MEDV as the output
variable. Select all other variables and move them to the Input variables box.
Click on Next to proceed to the next dialog box.
Variables:
This box displays all the available variables for processing. For
Multiple Linear Regression, this version of XLMiner™ requires all
independent variables to be numeric, and they are treated as continuous
variables. Any variables (data columns) containing any non-numeric value will
result in XLMiner™ ignoring that complete column or variable. For this example, select
MEDV as the output variable and move it via the ">"
to the "Output Variables" box. Move all other variables except
CATMDEV to the "Input Variable" box.
Weight: If the data set were such that
it often had multiple records (cases) with the same values on all variables,
it can be expressed more compactly by using another variable to record how
many cases share a common set of values on the other variables. Since
that is not the case in this example, a weight variable is not used
here.
In a situation
where the the number of rows in the data are less than the number of variables
selected, XLMiner™ displays the following message:
-
On
selecting Yes, XLMiner™ leads us directly to the Best Subset dialog (See below
in 7.)
The next dialog box of multiple linear regression is shown below and various
options are explained. For this exercise select all outputs to be displayed as
shown in the figure below:

Force
constant term to zero : If this option is checked, there will be no
constant term in the equation. We leave the option unchecked for this
example.
ANOVA Table: When this checkbox
is checked, the ANOVA table is displayed in the output.
Fitted values: When this
checkbox is checked, the fitted values are displayed in the output.
Variance-Covariance matrix: When
this checkbox is checked the variance-covariance matrix of the estimated
regression coefficients is displayed in the output.
Score on validation data: Check this box to show the scores that result from
applying the model to the validation data.
Score Test/New Data: Select the appropriate option for applying
the model to test data, new data, or neither. See
the Example of Discriminant Analysis for detailed instructions on this.
Score New data in database : See
the Example of Discriminant Analysis for detailed instructions on this.
-
Display of Residuals
Unstandardized: When this
checkbox is checked the Unstandardized Residuals are displayed in the output.
Unstandardized residuals are computed by the formula
Unstandardized
residual = Actual response - Predicted response
Standardized: When this checkbox is checked the Standardized Residuals are
displayed in the output. Standardized residuals are obtained by dividing the
unstandardized residuals by the respective standard deviations.
- Advanced:
Clicking on the Advanced tab in the above
dialog the following dialog appears:
Studentized: When this checkbox
is checked the Studentized Residuals are displayed in the output. Studentized
residuals are computed by dividing the unstandardized residuals by quantities
related to the diagonal elements of the hat matrix , using a common scale
estimate computed without the ith case in the model. These residuals
have t - distributions with ( n-k-1) degrees of freedom, so any residual with
absolute value exceeding 3 usually requires attention.
Deleted: When this checkbox is
checked the Deleted Residuals are displayed in the output. The residual for the
ith observation is obtained by
fitting the model with the ith observation omitted, using the model to
predict the ith observation and then computing the difference from
the actual ith observation.
Cook's Distance: When this
checkbox is checked the Cook's Distance for each observation is displayed in the
output. This is an overall measure of the impact of the ith datapoint
on the estimated regression coefficient. In linear models Cook's Distance has,
approximately, an F distribution with k and (n-k) degrees of freedom.
DF fits: When this checkbox is
checked the DF fits (change in the regression fit) for each observation is
displayed in the output. These reflect coefficient changes as well as
forecasting effects when an observation is deleted.
Covariance Ratios: When this
checkbox is checked the covariance ratios are displayed in the output. This
measure reflects the change in the variance-covariance matrix of the estimated
coefficients when the ith observation is deleted.
Hat matrix Diagonal: When this
checkbox is checked the diagonal elements of the hat matrix are displayed in the
output. This measure is also known as the leverage of the ith
observation.
Collinearity Diagnostics: When
this checkbox is checked the collinearity diagnostics are displayed in the
output.
Number of collinearity components: Enter the number of collinearity components. This number can be
between 2 and the number of degrees of freedom for the model.
When the model is fitted without an intercept, the model degrees of freedom is
equal to the number of predictors in the model. When the model is fitted with an
intercept, the model degrees of freedom is equal to the number of predictors in
the model plus one.
Multicollinearity Criterion:
Enter a value between 0 and 1.
Best Subset:
When you have a large number of predictors and
would like to limit the model to those that matter the most, use this option to
select the best subset of predictor variables. In this example, we
will ask XLMiner™ for a model upto 13 variables, and we will have XLMiner™
show us the 3 best such subsets of variables.
Check "Perform best subset
selection".
XLMiner™ performs and displays the Best subset selection.
Maximum
size of Best subsets:
Specify here the maximum size of the best subset. (The best subset
produced by XLMiner™
could be smaller.)
Number of best subsets:
Specify here the number of subsets to be shown.
XLMiner™
will first show the
best, then the next-best, etc., and will show this number of subsets for
subsets of one variable, subsets of two variables, etc., on up to subsets of
the size you specified above. The user can choose this number upto 20.
Selection Procedure:
- Backward elimination: Variables are eliminated one at a time, starting with the least
significant. We select this procedure for this example.
- Forward
selection: Variables are added one at a time, starting with the most
significant.
-
Exhaustive
search: Searches all combinations of variables for the best fit (can be
quite time-consuming, depending on the number of variables).
-
Sequential
replacement: For a given number of variables, variables are
sequentially replaced and replacements that improve performance are
retained.
-
Stepwise
selection: Like forward selection, but at each stage, variables can be
dropped or added.
FIN, FOUT: FIN and FOUT become
relevant only when the step-wise selection procedure is selected. In adding and
eliminating variables, an F-like statistic is calculated for regression. For a variable to come into the regression, the F-like
value must be greater than FIN (the default is 3.84). For a
variable to leave the regression, the F-like value must be less than FOUT
(the default is 2.71). The value you set for FIN must be greater
than the value you set for FOUT.
Note: If the constant term is forced to
zero
then all models reported by the best subset procedure will omit the constant
term.
Click on Finish, and the output of multiple
linear regression is presented on a new Excel sheet.
In addition to standard regression output, XLMiner™ shows the best subsets --
one row for each model, showing the variables included in that model.
Of primary interest in a data-mining context will be the predicted and actual
values for each record, along with the residual (difference), shown here for the
training data set:
XLMiner™ also displays Total sum of squared errors summaries for the training and
validation data sets. The total sum of squared errors is the sum of the
squared errors (deviations between predicted and actual values) and the root
mean square error (square root of the average squared error). Average
error is typically very small, because positive prediction errors tend to be
counterbalanced by negative ones.
A variety of residual and collinearity diagnostics output is available if it
was selected earlier in the setting; this is useful for advanced users and is
explained above.
Choose Subset :
XLMiner™ displays a list of different models it
generated using the Best Subset selection it has performed. Since we have
selected 13 as the maximum size of subsets, it displays a list of Best Subset
models created using 1 to 13 variables.

See the Best Subset output above. Every model includes a constant term (since we
have not selected "Force constant term to zero" in Step 2 of 2) and one or more
variables as the other coefficients. We can use any of these models for further
analysis by clicking on the respective link "Choose Subset". The choice
of model depends on the calculated values of various error values and the
probability. The error values calculated are :- RSS -->
The residual sum of squares, or the sum of squared
deviations between the predicted probability of success and the actual value (1
or 0),
Cp --> Mallows Cp (Total squared error) is a
measure of the error in the best subset model, relative to the error
incorporating all variables. Adequate models are those for which Cp is roughly
equal to the number of parameters in the model (including the constant), and/or
Cp is at a minimum, R-Squared --> R-squared
Goodness-of-fit, Adj. R-Squared --> Adjusted R-Squared values.
"Probability" is a
quasi hypothesis test of the proposition that a given subset is acceptable; if
Probability < .05 we can rule out that subset.
XLMiner™ gives a list of variables present in that particular Choose Subset
selection on which we place the grabber hand. When we move the grabber hand way down in the Best Subset output sheet
shown above, we see the following.

The value of RSS reduces to a near minimum
as we go from 11 coefficients to 12. Cp also attains the minimum value. So these 12 coefficients (a constant and 11
variables) may be sufficient to fit a regression. Select the link of first
occurrence of 12 coefficients. This will enable the dialog - Multiple Linear
Regression - Step 1 of 2 with these selected 11 variables as the input
variables. We can then go ahead with steps to run Multiple Linear Regression
utility.
MLR_Stored_1 : XLMiner™ generates this sheet along with the
other outputs. Please refer to the
Stored Model Sheets for details.
Lift charts :
Lift charts are visual aids for
measuring model performance. They consist of
a lift curve and a baseline. The greater the area between the lift curve
and the baseline, the better the
model.
Method of drawing :
After the model is built using the
training data set, the model is used to score on the training data set and the
validation data set (if exists). Then the data set(s) are sorted using the
predicted output variable value (or predicted probability of success in the
logistic regression case). After sorting, the actual outcome values of the
output variable is cumulated and the lift curve is drawn as number of cases versus the cumulated
value. The baseline is drawn as number of cases versus the average of actual
output variable values multiplied by the number of cases.
The
decilewise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by
the decile's average output variable value.