| Basic Commands |
Probability Puzzles |
Hypothesis Test, Count Data |
Hypothesis Test, Measured Data |
Confidence Interval, Count Data |
Confidence Interval, Measured Data |
Association / Correlation |
Regression |
Other Examples |
Cholesterol
Problem
The following are cholesterol reduction scores for nine men who were given the drug cholestyramine: -21.0, 3.25, 10.75, 13.75, 32.5, 39.5, 41.75, 56.75, 62.1. (The negative reading indicates that one man had increased cholesterol rather than a reduction.) How confidently can we draw a conclusion about the effect of the drug on cholesterol reduction (i.e., the central tendency) with experimental data like this that may have some extreme values ("outliers")? One approach to eliminating the destabilizing effect that outliers can have on measures of central tendency is to trim off the highest and the lowest values. Trimming the two highest and the two lowest scores from this set of nine produces a "22% trimmed mean." To determine the effect of trimming extreme values from this data set, we must compare the standard deviation of the 22% trimmed mean of the cholesterol scores to that of the untrimmed scores (see Efron & Tibshirani, 1991).
Resampling Procedure
Find the standard deviation of the untrimmed scores.
- Write the 9 scores onto 9 pieces of paper.
- Mix up the papers, select 1 and copy down its number, then replace the paper. Repeat this until you have a list of 9 numbers.
- Calculate the mean of the untrimmed scores, and record it on a scoreboard.
- Repeat (2) and (3) 1,000 times.
- Calculate the standard deviation of the recorded scores.
Find the standard deviation of the trimmed scores.
- Write the 9 scores onto 9 pieces of paper.
- Mix up the papers, select 1 and copy down its number, then replace the paper. Repeat this until you have a list of 9 numbers.
- Sort the list from lowest to highest. Eliminate the two lowest and the two highest scores.
- Calculate the mean of the remaining scores, and record it on a scoreboard.
- Repeat (2) through (4) 1,000 times.
- Calculate the standard deviation of the recorded scores.
Computer Implementation in Resampling Stats
DATA (-21 3.25 10.75 13.75 32.5 39.5 41.75 56.75 62.1) effect
the vector "effect" holds all the experimental data, that is, the 9 scores
REPEAT 1000
SAMPLE 9 effect effect$
"effect$" is a simulated group of results. We are going to take means of this group with and without trimming the two highest and two lowest scores
MEAN effect$ untrim$
SCORE untrim$ untrim$$
record the mean of the untrimmed scores on a scoreboard called "untrim$$"
SORT effect$ sorted
sort the scores in preparation for trimming the highest and lowest
TAKE sorted 3,7 middle$
eliminate the two highest and the two lowest scores
MEAN middle$, trim$
calculate the mean of the scores, trimmed to the middle 55%
SCORE trim$ trim$$
save the mean of the trimmed scores on a scoreboard called "trim$$"
END
STDEV untrim$$ sdall
STDEV trim$$ sdtrim
this is the standard deviation of the mean of trimmed scores
PRINT sdall sdtrim
Results
sdall = 8.58
sdtrim = 10.81
Conclusion
Although we might have expected that removing the most extreme values would make the standard deviation less, the resampling simulation shows that more precise results would be obtained by retaining all data. The standard deviation of the trimmed data was 10.8 compared with 8.6 for untrimmed data. In other words, the means obtained from 9 values were more consistent than the means obtained from only 5 values, and any stability gained by trimming was more than offset by the loss of stability from a smaller sample. In this case, the original data did not have sufficiently extreme outliers to make it worthwhile to take a 22% trimmed mean. Had we trimmed a smaller portion of the data, or had one or two of the values been more extreme, then trimming might have been a useful tool. Exercise: What happens if you trim only one value from each end (an 11% trim)?
References
Efron, B., & Tibshirani, R. J. (1991, July 26). Statistical data analysis in the computer age. Science, 253, pp. 390-395.