| Basic Commands |
Probability Puzzles |
Hypothesis Test, Count Data |
Hypothesis Test, Measured Data |
Confidence Interval, Count Data |
Confidence Interval, Measured Data |
Association / Correlation |
Regression |
Other Examples |
Cheese
Problem
A food company experimented with different levels of salt and fat in a cheese product (measured from a baseline that we will call "0 salt, 0 fat"). There were 13 different trials resulting in 13 observations (5 of which were all at the baseline level). Data on the levels of salt and fat and on resulting consumer acceptance are in file "cheese.dat" and as follows:
Cheese Table. Consumer Acceptance of Cheese by Salt and Fat Levels
| Salt levels |
Fat levels |
Consumer acceptance |
| -1 |
1 |
4.2 |
| -1 |
1 |
2.8 |
| 1 |
1 |
7.4 |
| 1 |
-1 |
6.1 |
| -1.41 |
0 |
3.4 |
| 1.41 |
0 |
6.6 |
| 0 |
-1.41 |
4.6 |
| 0 |
1.41 |
7.0 |
| 0 |
0 |
5.2 |
| 0 |
0 |
5.6 |
| 0 |
0 |
5.4 |
| 0 |
0 |
6.0 |
| 0 |
0 |
5.6 |
Product acceptance is regressed on fat, salt, the product of fat * salt, the square of salt, and the square of fat. The output from such a regression is a set of six prediction coefficients describing the following model:
Acceptance = B1*salt + B2*fat + B12*salt*fat + B11*salt*salt + B22*fat*fat + constant
Assess the variability in these regression parameters by resampling the residuals.
Why resample the residuals? Recall that our goal is to know how stable the results are - how much might the variables' coefficients differ were we to repeat the experiment over and over? If we had the time and money, we would actually repeat the experiment and, with the new product acceptance scores, observe how much the coefficients change from one experiment to the next. Clearly, we lack the time and money to do this. We would like to find a source for simulated experiment results that would allow us to repeatedly recalculate the coefficients. At first glance, it might appear reasonable to put the 13 experiment results on cards (three numbers per card, representing salt, fat, and product acceptance), place the cards in a hat, then resample 13 from the hat and recalculate the regression.
However, that regards the experimental combinations of salt and fat themselves as random effects. Our initial levels were very carefully picked - they were not a random selection of levels of salt and fat. To minimize experimental costs, most combinations have only one experiment, and salt and fat levels are systematically varied in order to produce maximum information. It will help if we conceive of a product acceptance score as a function of levels of cheese and fat, plus a random component: PA f(Salt, fat, random error).
The salt and fat levels we regard as fixed, but the fluctuation in the random error term is what lends uncertainty to the coefficient estimates. Therefore, we can resample the error term and create new, resampled data sets with which to make new, resampled coefficient estimates. Of course, we do not know the true errors because we cannot be certain of the true model. Uncertainty surrounding the true model is what brought us to this point! However, we can use the residuals from the fitted equation to estimate the error terms.
Resampling Procedure
Compute the regression equation for acceptance as a function of the five input parameters listed above. For each of the 13 combinations, plug the salt and fat levels into the computed formula to obtain 13 predicted acceptance levels. Subtract the predicted from the actual acceptance levels; these differences (the residuals) represent the discrepancies between predicted and actual results. What if the residual for, say, combination 4 above were to be applied to, say, combination 8? Since the residuals are a kind of uncertainty fuzz around the forecast values, we can explore the effects of rearranging but not enlarging that uncertainty.
- Write down the 13 residuals on pieces of paper. Shuffle the papers into a hat.
- Sample, with replacement, 13 residuals. Some will be positive, some negative. Keep in order of drawing.
- Add these values to the 13 predicted acceptance levels. Using this new set of simulated acceptance values, perform another regression to develop six new coefficients. Record each of these coefficients on separate scoreboards.
- Repeat (3) and (4) 1,000 times. Examine each of the scoreboards to see how widely they diverge, that is, the 5% and 95% percentiles. These results show the degree of uncertainty for each of the six prediction coefficients.
Computer Implementation in Resampling Stats
READ file "cheese.dat"; salt fat accept
each of these vectors, "salt," "fat," and "accept," will acquire 13 values
MULTIPLY salt fat saltfat
"Salt" and "fat" are vectors with 13 values each, "saltfat" will acquire 13 values. Resampling Stats treats "saltfat" as a label, not as an instruction to multiply "salt" by "fat."
SQUARE salt saltsq
SQUARE fat fatsq
REGRESS accept salt fat saltfat saltsq fatsq model
Work out a regression model to forecast acceptance as a function of salt, fat, salt X fat, salt-squared and fats-squared. Put the estimated coefficients into vector "model."
TAKE model 1 B1
the first value in vector "model" is the coefficient for the effect of salt on acceptance
MULTIPLY B1 salt B1salt
"salt" has 13 values in it, each of which will be multiplied by the single coefficient in B1 and put into "B1salt"
TAKE model 2 B2
take the second value in vector "model" and copy it into "B2," the coefficient applicable to fat level
MULTIPLY B2 fat B2fat
TAKE model 3 B12
Continue taking individual coefficients. "B12" is the coefficient for variable "saltfat."
MULTIPLY B12 saltfat B12sf
TAKE model 4 B11
"B11" signifies the coefficient applicable to salt-squared, i.e., salt * salt
MULTIPLY B11 saltsq B11ss
TAKE model 5 B22
"B22" is the coefficient for fat * fat
MULTIPLY B22 fatsq B22ff
TAKE model 6 B0
the sixth value in "model" is the constant; now add together for vector of predicted values
ADD B0 B1salt B2fat B12sf B11ss B22ff accepth
Each vector added together is a 13-value list of forecast acceptance levels forecast from each individual component. "Accepth," therefore, has 13 values, too, each of which represents a predicted acceptance level for the row-values of salt and fat.
SUBTRACT accept accepth resid
What is the difference between the forecast values for each of the 13 cases, and the experimental acceptance? The answer is an array of 13 values, the residuals.
COPY (0) null
prepare to use the "sumabsdev" command
SUMABSDEV resid null sumresid
subtract the "resid" array from an array
of zeroes, then add the results disregarding signs
PRINT model
we should see six coefficients, for salt, fat, saltfat, salt-sqrd, fat-sqrd, and constant
PRINT sumresid
If the study is done again, we would like to know how well the new data fits. The sum of residuals gives an indication, since the better our model fits, the smaller the residuals
'PRINT accepth
Remove the <'> to see the forecast acceptance levels, which can be compared with the actual acceptance levels as shown in the table above
REPEAT 1000
begin a simulation to evaluate what would happen if the acceptance levels had been randomly shifted to the extent of the residuals from our prediction equation
SAMPLE 13 resid resid$
scramble the residuals (the discrepancies between experimental and predicted acceptance values), putting them into vector "resid$" where the <$> indicates a simulation equivalent
ADD resid$ accepth accept$
modify the predicted acceptance figures by adding the scrambled residual
REGRESS noprint accept$ salt fat saltfat saltsq fatsq model$
Perform another regression, but this time with the altered "accept$" figures. The "noprint" option prevents 1,000 detailed reports on the progress of the regression.
SCORE model$ B1$ B2$ B12$ B11$ B22$ constnt$
Since "SCORE" saves only one value into each scoreboard vector, we provide a series of destination vectors. Each of these will receive just one new value for each Repeat loop.
END
PERCENTILE B1$ (5 95) B1-range
Vector "B1$" holds 1,000 possible values for coefficient B1, based on simulating different acceptance levels. So the 5%-95% shows us reliability for the original computation of coefficient B1.
PERCENTILE B2$ (5 95) B2-range
PERCENTILE B12$ (5 95) B12range
PERCENTILE B11$ (5 95) B11range
PERCENTILE B22$ (5 95) B22range
PERCENTILE constnt$ (5 95) conrange
PRINT B1-range B2-range B12range B11range B22range conrange
Results
| Coefficient |
Model |
.05 limit |
.95 limit |
| B1 (salt) |
1.371 |
1.076 |
1.679 |
| B2 (fat) |
0.487 |
0.221 |
0.789 |
| B12 (saltfat) |
0.525 |
0.079 |
0.995 |
| B11 (saltsq) |
-0.346 |
-0.633 |
-0.064 |
| B22 (fatsq) |
0.056 |
-0.234 |
0.332 |
| Constant |
5.56 |
>5.18 |
5.88 |
Conclusion
The coefficient for the fat-squared parameter is very small and cannot be distinguished from zero. Perhaps the food company should repeat the regression analysis, omitting any fat-squared term. It looks as though salt is the main ingredient that leads to higher acceptance, with fat somewhat less important. The saltfat term may be nearly as important as fat itself, but the spread in its coefficient (B12) is very large.