THE NEW BIOSTATISTICS OF RESAMPLING Julian L. Simon and Peter Bruce INTRODUCTION A large proportion of articles in important medical journals nowadays employ probabilistic-statistical machinery; for example, 42 percent of original articles in a 1978-79 sample from the New England Journal of Medicine did so (Emerson and Colditz, 1992). These statistical devices are not understood well (if at all) by many clinicians and researchers. Furthermore, such methods are often not used correctly for the context in which they are em- ployed. For example, a study of 50 articles in the New England Journal of Medicine which used the t statistic to compare the means of three or more groups found that more than half the uses were not appropriate (Godfrey, 1992). Even if the researcher has chosen a sound technique, the reader is hard put to understand what the researcher has done. Consider this recent not-atypical example in the American Journal of Public Health: The BMDP statistical package was used to determine statistical significance by entering age, gender, and smoking status (current smoker, former smoker, never smoker) as variables in a polychotomoous logistic regression model and examining the P values associated with the regression coefficients...[The] model showed that age (P<.001 and gender (P=.02) were significantly related to smoking status. (Hensrud and Sprafka, 1993, p. 415). The reason that these conventional techniques are daunting to the reader, and often misused by researchers, is that they are inherently deep and complicated mathematically. Even the rela- tively-simple t test for the comparison of two sample means is built upon a difficult body of formulae such as the Normal ap- proximation, which contains unintuitive elements such as pi and e (the base of natural logarithms). And even this relatively- simple statistical test can only properly be used after consult- ing a body of rules about when it is and is not applicable; the process resembles cooking with a cookbook, and among students is widely known as "pluginski". As the example above shows, the reader is asked to take on faith that the test used is appropriate and the program is correct; no details are given that would allow even the sophisticated reader to make an informed judgment. In recent decades, an entirely different approach to statis- tical testing has developed. At a theoretical level the resam- pling method has taken the world of mathematical statistics by storm. Resampling is at least as efficient as the formulaic method in most situations. More important for purposes here, resampling is transparently clear to both researcher and reader, which reduces the likelihood that "Type 4 error" will occur - that is, use of the wrong method; this intellectual advantage has been shown in controlled experiments (Simon, Atkinson, and Shevo- kas, 1976). We shall first show the method in action, comparing resam- pling procedures with the standard formulaic treatment of the same data in a standard biostatistics text. After these specific examples we provide a general procedure for the resampling meth- od, and then end with discussion of the general properties of the method. INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL Let's consider one of the simplest numerical examples of probabilistic-statistical reasoning given toward the front of a standard book on medical statistics (Kahn and Sempos, 1989). Using data from the Framingham study, the authors ask: What is an appropriate "confidence interval" on the observed ratio of "relative risk" (a measure which is defined below, closely related to the odds ratio) of the development of myocardial infarction 16 years after the study began, for men ages 35-44 with serum cholesterol either above 250, or equal to or below 250? The raw data are shown in Table 1. Table 1 (Kahn and Sempos Table 3-8, p. 61) The reader of the text is provided with five pages of alge- bra leading to a formula which is not only cumbersome to use as well as being mathematically opaque except to a mathematical statistician; also, it applies only if the risk is less than 10 percent and the "data set is large enough", a statistics journal reference being provided in case the data set is not large enough. Then the reader is given an "alternative method that is very much easier to calculate", but which "we cannot explain in terms of elementary statistics" (p. 62). Rather than addressing the relative-risk problem immediate- ly, let's work into it slowly, using the same data but breaking the problem into parts to which we apply simpler procedures. Hypothesis Tests With Measured Data Consider this classic question about the Framingham serum cholesterol data: What is the degree of surety that there is a difference in myocardial infarction rates between the high- and low-cholesterol groups? The statistical logic begins by asking: How likely is that the two observed groups "really" came from the same "population" with respect to infarction rates? Operationally, we address this issue by asking how likely it is that two groups as different in disease rates as the observed groups would be produced by the same "statistical universe". Key step: we assume that the relevant "benchmark" or "null- hypothesis" population (universe) is the composite of the two observed groups. That is, if there really were no "true" differ- ence in infarction rates between the two serum-cholesterol groups, and the observed disease differences occurred just be- cause of sampling variation, the most reasonable representation of the population from which they came is the composite of the two observed groups. Therefore, we compose a hypothetical "benchmark" universe containing (135 + 470 =) 605 men at risk, and designate (10 + 21 =) 31 of them as infarction cases. We want to determine how likely it is that a universe like this one would produce - just by chance - two groups that differ as much as do the actually observed groups. That is, how often would random sampling from this universe produce one sub-sample of 135 containing a large enough number of infarctions, and the other sub-sample of 470 producing few enough infarctions, that the difference in occur- rence rates would be as high as the observed difference of .029? (10/135 = .074, and 21/470 = .045). So far, everything that has been said applies both to the conventional formulaic method and to the "new statistics" resam- pling method. But the logic is seldom explained to the reader of a piece of research - if indeed the researcher her/himself grasps what the formula is doing. And if one just grabs for a formula with a prayer that it is the right one, one need never analyze the statistical logic of the problem at hand. Now we tackle this problem with a method that you would think of yourself if you began with the following mind-set: How can I simulate the mechanism whose operation I wish to under- stand? These steps will do the job: 1. Fill an urn with 605 balls, 31 red and the rest (605 - 31 = 574) green. 2. Draw one sample of 135 (simulating the high serum- cholesterol group), one ball at a time and throwing it back after it is drawn to keep the simulated probability of an infarc- tion the same throughout the sample; record the number of reds. Then do the same with another sample of 470 (the low serum- cholesterol group). 3. Calculate the difference in infarction rates for the two simulated groups, and compare it to the actual difference of .029; if the simulated difference is that large, record "Yes" for this trial; if not, record "No". 4. Repeat steps 2 and 3 until a total of (say) 400 or 1000 trials have been completed. Compute the frequency with which the simulated groups produce a difference as great as actually ob- served. This frequency is an estimate of the probability that a difference as great as that actually observed in Framingham would occur even if serum cholesterol has no effect upon myocardial infarction. The procedure above can be carried out with balls in a ceramic urn in a few hours. Yet it is natural to seek the added convenience of the computer to draw the samples. Therefore, we illustrate in Figure 1 how a simple computer program handles this problem. We use our own RESAMPLING STATS, but it can be executed in other languages as well, though usually with more complexity and less clarity. Figure 1 The results of the test using this program may be seen in the histogram in Figure 1. We find - perhaps surprisingly - that a difference as large as observed would occur by chance fully 10 percent of the time. (If we were not guided by the theoretical expectation that high serum cholesterol produces heart disease, we might include the 10 percent difference going in the other direction, giving a 20 percent chance). Even a ten percent chance is sufficient to strongly call into question the conclusion that high serum cholesterol is dangerous. At a minimum, this statis- tical result should call for more research before taking any strong action clinically or otherwise. Where should one look to determine which procedures should be used to deal with a problem such as set forth above? Unlike the formulaic approach, the basic source is not a manual which sets forth a menu of formulas together with sets of rules about when they are appropriate. Rather, you consult your own under- standing about what it is that is happening in (say) the Framing- ham situation, and the question that needs to be answered, and then you construct a "model" that is as faithful to the facts as is possible. The urn-sampling described above is such a model for the case at hand. To connect up what we have done with the conventional ap- proach, we apply a z test (conceptually similar to the t test, but applicable to yes-no data; it is the Normal-distribution approximation to the large binomial distribution) and we find that the results are much the same as the resampling result - an eleven percent probability. Someone may ask: Why do a resampling test when you can use a standard device like a z or t test? The great advantage of resampling is that it avoids "Type 4 error" - using the wrong method. The researcher is more likely to arrive at sound conclusions with resampling because s/he can understand what s/he is doing, instead of blindly grabbing a formula which may be in error. The textbook drawn from here is an excellent one; the diffi- culty of the presentation is an inescapable consequence of the formulaic approach to probability and statistics. The body of complex algebra and tables that only a rare expert understands down to the foundations constitutes an impenetrable wall to understanding. Yet without such understanding, there can be only rote practice, which leads to frustration and error. Confidence Intervals for the Counted Data Consider for now just the data for the sub-group of 135 high-cholesterol men. A second classic statistical question is as follows: How much confidence should we have that if we were take a much larger sample than was actually obtained, the mean (actually the proportion 10/135 = .07) would be in some vicinity of the observed sample mean? Let us first carry out a resampling procedure to answer the questions, waiting until afterwards to discuss the logic of the inference. 1. Construct an urn containing 135 balls - 10 black (in- farction) and 125 red (no infarction) to simulate the universe as we guess it to be. 2. Mix, choose a ball, record its color, replace it, and repeat 135 times (to simulate a sample of 135 men). 3. Record the number of black balls among the 135 drawings. 4. Repeat steps 2-4 perhaps 1000 times, and observe how much the number of blacks varies from sample to sample. We arbi- trarily denote the boundary lines that include 45 percent of the hypothetical samples in each side of the sample mean as the 90 percent "confidence intervals" around the mean of the actual population. Figure 2 shows how this can be done easily on the computer, together with the results. Figure 2 The variation in the histogram in Figure 2 highlights the fact that a sample containing only 10 cases of infarction is very small, and the number of observed cases - or the proportion of cases - necessarily varies greatly from sample to sample. Perhaps the most important implication of this statistical analysis, then, is that we badly need to collect additional data. This is a classic problem in confidence intervals, found in all subject fields. For example, at the beginning of the first chapter of a best-selling book in business statistics, Wonnacott and Wonnacott use the example of a 1988 presidential poll. The language used in the cholesterol-infarction example above is exactly the same as the language used for the Bush-Dukakis poll except for labels and numbers. Also typically, the text gives a formula without explaining it, and says that it is "fully derived" eight chapters later (Wonnacott and Wonnacott, 1990, p. 5). With resampling, one never needs such a formula, and never needs to defer the explanation. The philosophic logic of confidence intervals is quite deep and controversial, less obvious than for the hypothesis test. The key idea is that we can estimate for any given universe the probability P that a sample's mean will fall within any given distance D of the universe's mean; we then turn this around and assume that if we know the sample mean, the probability is P that the universe mean is within distance D of it. This inversion is more slippery than it may seem. But the logic is exactly the same for the formulaic method and for resampling. The only difference is how one estimates the probabilities - either with a numerical resampling simulation, or with a formula or other deductive mathematical device (such as counting and partitioning all the possibilities, as Galileo did when he answered a gam- bler's question about three dice.) And when one uses the resam- pling method, the probabilistic calculations are the least de- manding part of the work. One then has mental capacity available to focus on the crucial part of the job - framing the original question soundly, choosing a way to model the facts so as to properly resample the actual situation, and drawing appropriate inferences from the simulation. If you have understood the general logic of the procedures used up until this point, you are in command of all the necessary conceptual knowledge to construct your own tests to answer any statistical question. A lot more practice, working on a variety of problems, obviously would help. But the key elements are simple: 1) Model the real situation accurately, 2) experiment with the model, and 3) compare the results of the model with the observed results. Confidence Intervals on Relative Risk With Resampling Now we are ready to calculate - with full understanding - the confidence intervals on relative risk that the text sought. Recall that the observed sample of 135 high cholesterol men had 10 infarctions (a proportion of .074), and the sample of 470 low cholesterol men had 21 infarctions (a proportion of .045). We estimate the relative risk of high cholesterol as .074/.045. Let us frame the question this way: If we were to randomly draw a sample from the universe of high-cholesterol men that is best estimated from our data (.074 percent infarctions), and a sample from the universe of low-cholesterol men (.045 percent infarc- tions), and do this again and again, within which bounds would the relative risk calculated from that simulation fall (say) 95 percent of the time? The operation is quite the same as that for a single confi- dence interval estimated above except that we do the operation for both sub-samples at once, and then calculate the ratio bet- ween their results. As before, we would like to know what would happen if we could take additional samples from the universes that spawned our actual samples. Lacking the resources to do so, we let those original samples "stand in" for the universes from which they came, serving as proxy "substitute universes." We can imagine replicating each sample element millions of times to "bootstrap" these "proxy universes." Paralleling the real world, we take simulated samples of the same size as our original sam- ples. (Actually, we can skip replicating each sample element a million times and achieve the same resampling effect by sampling with replacement from our original samples -- that way, the chance that a sample element will be drawn will remain the same from draw to draw.) We count the number of infarctions in each of our resamples, and for the pair of resamples, we calculate the relative risk measure and keep score of this result. We then take additional pairs of resamples, each time calculating the relative risk measure. We may compare our results in Figure 3 - a confidence interval extending from 0.69 to 3.4 - to the results given in Kahn and Sempos, which are 0.79 to 3.5. 0.80 to 3.4, and 0.79 to 3.7 from three different formulas (pp. 62-63); the agreement is close. Figure 3 It is interesting that this may be the first time a calcula- tion of relative risk using resampling has ever been published. And it therefore should be a contribution to the statistics literature comparable with the formulaic approaches published in earlier years. But because the procedure is worked out here on an ad hoc basis, and does not seem to be very difficult, it probably is not worth publishing separately. We point this out because resampling routinely produces entirely new procedures at least as powerful as the previously-existing formulaic proce- dures. These resampling procedures also have the advantage of being fully understood even by persons who are not professional statisticians but who think hard about their subject matter, and then create appropriate procedures by working from first princi- ples and modeling their actual research situations with care and understanding. Even underclasspersons in a state university are able to do this; one would expect persons in medical school or beyond it to be at least equally capable. That is the true revo- lution wrought by resampling. SOME OTHER ILLUSTRATIONS A Measured-Data Example: Test of a Drug to Prevent Low Birthweight The Framingham infarction-cholesterol examples worked with yes-no "count" data. Let us therefore consider some illustrations of the use of resampling with measured data. Another leading textbook (Rosner, 1982, p. 257) gives the example of a test of the hypothesis that drug A prevents low birthweights. The data for the treatment and control groups are shown in Table 2. Here is a resampling approach to the problem: Table 2 1. If the drug has no effect, our best guess about the "universe" of birthweights is that it is composed of (say) a million each of the observed weights, lumped together. In other words, in the absence of any other information or compelling theory, we assume that the combination of our samples is our best estimate of the universe. Hence write each of the birthweights on a card, and put them into a hat. Drawing them one by one and then replacing them is the operational equivalent of a very large (but equal) number of each birthweight. 2. Repeatedly draw two samples of 15 each, and check how frequently the observed difference is as large or larger than the actual difference. We find in Figure 4 that only 1% of the pairs of hypothetical resamples produced means that differed by as much as .82. We therefore conclude that the observed difference is unlikely to have occurred by chance. Figure 4 Matched-Patients Test of Three Treatments There have been several recent three-way tests of treatments for depression: drug versus cognitive therapy versus combined drug and cognitive therapy. Consider this procedure for a proposed test in 31 triplets of people have been matched within triplet by sex, age, and years of education. The three treatments are to be chosen randomly within each triplet. Assume that the outcomes on a series of tests were ranked from best (#1) to worst (#3) within each triplet, and assume that the combined drug-and- therapy regime has the highest average rank. How sure can we be that the observed result would not occur by chance? In hypothetical Table 3 the average rank for the drug and therapy regime is 1.74. Is it possible that the regimes do not differ with respect to effectiveness, and that the drug and therapy regime came out with the best rank just by the luck of the draw? We test by asking "If there is no difference, what is the probability of getting an average rank this good, just by chance?" Table 3 Figure 5 shows a program for a resampling procedure that repeatedly produces 31 sets of ranks randomly selected among the numbers 1, 2 and 3, and averages the ranks for each treatment. We can then observe whether an average of 1.74 is unusually low, and hence should not be ascribed to chance. Figure 5 In 1000 repetitions of the simulation, 5% yielded average ranks as low as the observed value. This is evidence that something besides chance might be at work here. (The result is at the borderline of the traditional 5% "level of significance" (a p-value of .05), supposedly set arbitrarily by the great statistician R.A. Fisher on the grounds that a 1-in-20 happening is too coincidental to ignore.) That is, the resampling test suggests that it would be very unlikely for one of the treatment regimes to achieve, just by chance, results as much better than the other two regimes as are actually observed. An interesting feature of this problem is that it would be hard to find a conventional test that would handle this three-way comparison in an efficient manner. Certainly it would be impossi- ble to find a test that would not require formulae and tables that only a talented professional statistician could manage satisfactorily, and even the professional is not likely to fully understand those formulaic procedures. A DEFINITION AND GENERAL PROCEDURE FOR RESAMPLING A statistical procedure manipulates some replica of the physical process in which you are interested. A resampling method simulates (models) the process with easy-to-handle sym- bols. The resampler postulates a universe composed of the ob- served data, which are then used to produce new hypothetical samples whose properties are then examined. That is, one exam- ines how the universe behaves, comparing the outcomes to a crite- rion that we choose. Here is an "operational definition" of resampling: Using the entire set of data you have in hand, produce new samples of simulated data, and examine the results of those samples. That's it in a nutshell. VARIETIES OF RESAMPLING METHODS A resampling test may be constructed for almost any statis- tical inference. Every real-life situation can be modeled by symbols of some sort, and one may experiment with this model to obtain resampling trials. The most important counterindication is insufficient data to perform a useful resampling test, in which case a conventional test - which makes up for the absence of observations with an assumed theoretical distribution - may produce more accurate results if the universe from which the data are selected resembles the chosen theoretical distribution. Exploration of the properties of resampling tests is an active field of research at present. For the main tasks in statistical inference - hypothesis testing and confidence intervals - the appropriate resampling test often is immediately obvious, as seen in the case of choles- terol and infarction rates above. (Technical note to biostatisticians: Two sorts of procedures are especially well-suited to resampling: 1) When the size of the universe is properly assumed fixed, or for other reasons sampling without replacement is called for, it is appropriate to sample from among the possible permutations of the data; this is an adaptation of Ronald Fisher's "exact" test (confusingly, also called a "randomization" test). The three-way drug test above is an illustration; the rank of one member of a triplet affects the possible ranks of the other two members, and hence the sampling is done "without replacement". 2) The bootstrap procedure is appropriate when the size of the universe is properly assumed not to be fixed in size, and the measurement of one entity in the sample does not affect the measurement of another entity. This device - for which there is no analog in conventional formulaic statistics - is illustrated by the birthweight test above.) Resampling is a much simpler intellectual task than the formulaic method, because simulation obviates the need to calcu- late the number of possible ways that the event in which you are interested - an infarction, say, or a birth of a certain size - can or cannot occur. In technical terms, resampling does not require computation of the "sample space" or any part of it. In all but the most elementary problems where simple permutations and combinations suffice, such calculations require advanced training and delicate judgment; these calculations are the root of the mathematical and conceptual difficulty of conventional formulaic statistics. Resampling avoids the complex abstraction of sample-space calculations by substituting the particular information about how elements in the sample are generated randomly in a specific event, as learned from the actual circumstances; the analytic method does not use this information. In the case of the gam- blers prior to Galileo, resampling used the (assumed) facts that three fair dice are thrown with an equal chance of any outcome, and they took advantage of experience with many such events performed one at a time; in contrast, Galileo made no use of the actual stochastic element of the situation, and gained no infor- mation from a sample of such trials, but rather replaced all possible sequences by exhaustive computation. The resampling method is not theoretically inferior to the formulaic method. Resampling is not "just" a stochastic- simulation approximation to formulas. It is a quite different route to the same endpoint, using different intellectual processes and utilizing different sorts of inputs; both resam- pling and formulaic calculation are shortcuts to estimation of the sample space and its partitions. Its much lesser intellectual difficulty is the source of the central advantage of resampling. It improves the probability that the user will arrive at a sound solution to a problem - the ultimate criterion for all except for pure mathematicians. The applicability of resampling is especially great in biostatistics because of the small and irregular samples so common in clinical research. THE PLACE OF RESAMPLING IN THE REALM OF KNOWLEDGE Probability theory and its offspring, inferential statistics, constitute perhaps the most frustrating branch of human knowledge. Right from its beginnings in the seventeenth century, the great mathematical discoverers knew that the probabilistic way of thinking -- which we'll call "prob-stats" for short -- offers enormous power to improve our decisions and the quality of our lives. Yet until very recently, when the resampling method came along, scholars were unable to convert this powerful body of theory into a tool that laypersons could and would use freely in daily work and personal life. Instead, only professional statisticians feel themselves in comfortable command of the prob- stats way of thinking. The most frequent applications are by medical and social scientists, who know that prob-stats is indis- pensable to their work yet too often fear and misuse it. Resampling is now fully accepted theoretically. The publication of advanced papers exploring its properties is proceeding at a breathtaking rate throughout the world. And controlled studies show that people ranging from engineers and scientists down to seventh graders quickly handle more problems correctly than they do with conventional methods. Furthermore, in contrast to the older conventional statistics, which is a painful and humiliating experience for most students at all levels, the published studies show that students enjoy resampling statistics. But the resampling has not yet penetrated very far into the classroom, for a variety of institutional and historical reasons. Resampling in Medical Education Prob-stats is the bane of medical students as well as all other students required to study it; the statistics course is a painful rite of passage -- like fraternity paddling -- on the way to a degree. Afterwards, the subject is happily put out of mind forever. Yet the practice of medicine becomes more and more dependent upon a knowledge of statistics. Physicians like to say that they practice on the basis of "clinical knowledge". Yet in an ever- growing proportion of situations, choice of treatment comes straight from research studies whose conclusions depend on sta- tistical tests. Without a sound understanding of inference, a physician cannot evaluate such studies and sort out which to rely upon. Teaching physicians statistics has been an impossible nut to crack. As one statistician wrote about her attempt to teach medical students conventional statistical methods: "I gazed into the sea of glazed eyes and forlorn faces, shocked by the looks of naked fear my appearance at the lectern prompted" (Vaisrub, 1990). Students of probability and statistics simply memorize the rules. Most users of prob-stats select their methods blindly, understanding little or nothing of the basis for choosing one method rather than another, and simply push the buttons for one or another easily available computer operation. This often leads to wildly inappropriate practices, and contributes to the damnation of statistics. The statistical community has made valiant attempts to ameliorate the situation. Great statisticians have struggled to find interesting and understandable ways to teach prob-stats. Learned committees and professional associations have wrung their hands in despair, and spent millions of dollars creating televi- sion series and text books. Despite successes, these campaigns to promote prob-stats have largely failed. The enterprise smash- es up against an impenetrable wall - the body of complex algebra and tables that only a rare expert understands right down to the foundations. For example, almost no one can write the formula for the "Normal" distribution that is at the heart of most sta- tistical tests. Even fewer understand its meaning. Yet without such understanding, there can be only rote learning. The resampling method, in combination with the personal computer, promises to cure this disease, and finally realize the great potential of statistics and probability. In the absence of formulae, black-box computer programs, and cryptic tables, the resampling approach forces you to directly address the problem at hand. Then, instead of asking "Which formula should I use?" one begins to ask more profound questions such as "Why is something `significant' if it occurs 4% of the time by chance, yet not `significant' if a random process pro- duces it 8% of the time?" About "Exactness" Earlier we suggested that the likelihood of arriving at a sound answer with a valid method, rather than using an incorrect method, is more important scientifically than any likely inexactness from the resampling simulation method. But even that concedes too much: The formulaic method itself is in no way perfectly exact; rather, it rests on approximations. The Normal distribution itself is only an approximation to the binomial. And often there are approximations in computing formulas. [There also is a certain irony in the common objection that resampling is not "exact" because the results are "only" a sam- ple. The basis of all statistical work is sample data drawn from actual populations. Statisticians have only recently managed to win battles against those bureaucrats and social scientists who, out of ignorance of statistics, believed that only a complete census of a country's population, or examination of every volume in a library, could give satisfactory information about unemploy- ment rates or book sizes. Indeed, samples are sometimes even more accurate than censuses. Yet many of those same statisti- cians have been skittish about simulated samples of data points taken from the sample space - drawn far more randomly than the data themselves, even at best. They tend to want a complete "census" of the sample space, even when sampling is more likely to arrive at a correct answer because it is intellectually sim- pler (as with the gamblers and Galileo.)] CONCLUSION Probabilistic analysis is crucial in medicine, perhaps more so than in any other discipline. Judgments about whether to use one treatment or another, or to allow a new medicine on the market, require that the decision-maker assess chance variability in the data. But until now, the practice and teaching of proba- bilistic statistics, with its abstruse structure of mathematical formulas cum tables of values based on restrictive assumptions concerning data distributions -- all of which separate the user from the actual data or physical process under consideration -- have kept the full fruits of statistical understanding from the medical community. Estimating probabilities with conventional mathematical methods is often so complex that the process scares many people. And properly so, because the difficulties lead to frequent errors. The statistical profession has long expressed grave concern about the widespread use of conventional tests whose foundations are poorly understood. The recent ready availability of statistical computer packages that can easily perform conventional tests with a single command, irrespective of whether the user understands what is going on or whether the test is appropriate, has exacerbated this problem. This has led teachers to emphasize descriptive statistics and even ignore inferential statistics. Beneath every formal statistical procedure there lies a physical process. Resampling methods allow one to work directly with the underlying physical model by simulating it. The term "resampling" refers to the use of the given data, or a data generating mechanism such as a die, to produce new samples, the results of which can then be examined. Resampling estimates probabilities by numerical experiments instead of with formulae -- by flipping coins or picking numbers from a hat, or with the same operations simulated on a computer. The resampling method enables people to obtain the benefits of statistics and probability theory without the shortcomings of conventional methods, because it is free of mathematical formulas and restrictive assumptions and is easy to understand and use, especially in conjunction with the computer language and program RESAMPLING STATS. It is the overall approach - the propensity to turn first to resampling methods to handle practical problems - that most clearly distinguishes resampling from conventional statistics. In addition, some resampling methods are new in themselves, the result of the basic resample-it tendency of the past quarter century. Resampling replaces the complex mathematical calculations about the size of the sample space and its parts by simulating the conditions that produce the individual events; the informa- tion about these concrete conditions is not used by the formulaic method. This very different intellectual method is the source of its clarity and simplicity. REFERENCES Edgington, Eugene S., Randomization Tests, Marcel Dekker, N. Y., 1980 Efron, Bradley, and Diaconis, Persi; "Computer Intensive Methods in Statistics," Scientific American, May, 1983, pp. 116- 130. Emerson, John D., and Graham A. Colditz, "Use of Statistical Analysis in the New England Journal of Medicine", in John C. Bailar III and Frederick Mosteller, Medical Uses of Statistics (Boston: NEJM Books, 1992), pp. 45-57. Godrey, Katherine, "Comparing the Means of Several Groups", in John C. Bailar III and Frederick Mosteller, Medical Uses of Statistics (Boston: NEJM Books, 1992), pp. 233-258. Hensrud, Donald D., and J. Michael Sprafka, "The Smoking Habits of Minnesota Physicians", American Journal of Public Health, vol 83, March, 1993, 415-417. Kahn, Harold A., and Christopher T. Sempos, Statistical Methods in Epidemiology (New York: Oxford, 1989) Noreen, Eric W., Computer Intensive Methods for Testing Hypotheses, (New York: Wiley, 1989) Rosner, Bernard, Fundamentals of Biostatistics, (Boston: Duxbury, 1982) Simon, Julian L., Basic Research Methods in Social Science, 1969, (New York: Random House, 1989; 3rd Edition, 1985, with Paul Burstein) Simon, Julian L., Atkinson, David T., and Shevokas, Carolyn, "Probability and Statistics: Experimental Results of a Radically Different Teaching Method," American Mathematical Monthly, v. 83, No. 9, Nov. 1976 Simon, Julian L., and Bruce, Peter C., "Resampling: Everday Statistical Tool," Chance, v. 4, #1, 1991 Simon, Julian L., Resampling: Probability and Statistics a Radically Different Way (Belmont, CA: Wadsworth, forthcoming 1993). Vaisrub, Naomie, Chance, Winter, 1990, p. 53************* Wonnacott, Thomas H. and Ronald J. Wonnacott, Introductory Statistics for Business and Economics 4th edition (New York: Wiley, 1990). URN 31#1 574#2 men An urn called "men" with 31 ones (=infarctions) and 574 twos (=no infarction) SAMPLE 135 men high Sample (with replacement!) 135 of the numbers in this urn, give this group the name "high" SAMPLE 470 men low Same for a group of 470, call it low COUNT high =1 a Count infarctions in first group DIVIDE a 135 aa Express as a proportion COUNT low =1 b Count infarctions in second group DIVIDE b 470 bb Express as a proportion SUBTRACT aa bb c Find the difference in infarction rates SCORE c z Keep score of this difference END HISTOGRAM z COUNT z >=.029 k How often was the resampled difference >= the observed difference? DIVIDE k 1000 kk Convert this result to a proportion PRINT kk 200+ + + F + r + e 150+ q + u + e + n + ** c 100+ ** y + ** *** + ** *** * + ****** + ****** * Z 50+ *********** + *********** + ************ ** + **************** + ********************* 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| -0.1 -0.05 0 0.05 0.1 Difference between resamples (proportion with infarction) kk = 0.102 (the proportion of resample pairs with a difference >= .029) URN 10#1 125#0 men An urn (called "men") with ten 1's (infarctions) and 125 0's (no infarction) REPEAT 1000 Do 1000 trials SAMPLE 135 men a Sample (with replacement) 135 numbers from the urn, put them in "a" COUNT a =1 b Count the infarctions DIVIDE b 135 c Express as a proportion SCORE c z Keep score of the result END End the trial, go back and repeat HISTOGRAM z Produce a histogram of all trial results PERCENTILE z (2.5 97.5) k Determine the 2.5th and 97.5th percentiles of all trial results; these points enclose 95% of the results PRINT k F + r + e 150+ q + * u + * * e + * ** n + ** ** c 100+ ** ** * y + * ** ** * + * ** ** ** * + * ** ** ** + * ** ** ** Z 50+ * ** ** ** + * ** ** ** ** ** + * ** ** ** ** ** * + ** ** ** ** ** ** * + ** ** ** ** ** ** ** * 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 0 0.05 0.1 0.15 0.2 Proportion with infarction k = 0.037037 0.11852 (This is the 95% confidence interval, enclosing 95% of the resam- ple results) URN 10#1 125#0 high The universe of 135 high cholesterol men, 10 of whom (1's) have infarctions URN 21#1 449#0 low The universe of 470 low cholesterol men, 21 of whom (1's) have infarctions REPEAT 1000 Repeat the steps that follow 1000 times SAMPLE 135 high high$ Sample 150 (with replacement) from the high cholesterol universe, and put them in "high$" [the "$" suffix just indicates a resampled counterpart to the actual sample] SAMPLE 470 low low$ Similarly for 470 from the low cholesterol universe COUNT high$ =1 a Count the infarctions in the first resampled group DIVIDE a 135 aa Convert to a proportion COUNT low$ =1 b Count the infarctions in the second resampled group DIVIDE b 470 bb Convert to a proportion DIVIDE aa bb c Divide the proportions to calculate relative risk SCORE c z Keep score of this result END End the trial, go back and repeat HISTOGRAM z Produce a histogram of trial results PERCENTILE z (2.5 97.5) k Find the percentiles that bound 95% of the trial results PRINT k F + * r + * e 75+ * * q + * * u + * * e + **** * * n + **** * * c 50+ * ****** * y + * ********* + ************ * + ************* * + **************** * Z 25+ **************** * + ******************** + * ******************** * + *********************** * * + ****************************** * * * 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 0 1 2 3 4 5 6 Relative risk Results (estimated 95% confidence interval): k = 0.68507 3.3944 NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7 8.6) treat NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2 6.8) control CONCAT treat control all Combine all observations in same vector REPEAT 1000 Do 1000 simulations SAMPLE 15 all treat$ Take a resample of 15 from all birthweights (the $ indicates a resampling counterpart to a real sample) SAMPLE 15 all control$ Take a second, similar resample MEAN treat$ mt Find the means of the two resamples MEAN control$ mc SUBTRACT mt mc dif Find the difference between the means of the two resamples SCORE dif z Keep score of the result END End the simulation experiment, go back and repeat HISTOGRAM z Produce a histogram of the resample differences COUNT z >= .82 k How often did resample differences exceed the observed difference of .82? F + r + e 75+ q + u + e + n + * * * * c 50+ * * * *** * y + *********** + ************ * * + *************** + **************** Z 25+ * ******************* * + * ********************* + ** ********************** + ****************************** + *********************************** 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| -1.5 -1 -0.5 0 0.5 1 1.5 Resample differences in pounds Result: Only 1.3% of the pairs of resamples produced means that differed by as much as .82. We can conclude that the observed difference is unlikely to have occurred by chance. REPEAT 1000 Do 1000 simulations GENERATE 31 (1 2 3) ranks Generate 31 numbers, each number a 1, 2 or 3, to simulate random assignment of ranks 1-3 to the drug/ therapy alternative MEAN ranks rankmean Take the mean of these 31 SCORE rankmean z Keep score of the mean END End the simulation, go back and repeat HISTOGRAM z Produce a histogram of the rank means COUNT z <=1.74 k How often mean rank better than 1.74, the observed value? PRINT k 100+ + * * + * * F + * * * r + * ** * * e 75+ * ** * * q + ** ** * ** * u + ** ** * ** * e + ** ** * ** * n + ** ** * ** ** c 50+ ** ** * ** ** y + * ** ** * ** ** * + * ** ** * ** ** * * + * * ** ** * ** ** * + ** * ** ** * ** ** * * Z 25+ ** * ** ** * ** ** * ** + * ** * ** ** * ** ** * ** + * ** * ** ** * ** ** * ** * * + * * ** * ** ** * ** ** * ** * * * + * ** ** * ** * ** ** * ** ** * ** * ** * 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 1.4 1.6 1.8 2 2.2 2.4 2.6 Development of Mycardial infarction in Framingham after 16 Years Men Age 35-44, by Level of Serum cholesterol Serum cholesterol Developed MI Did not develop MI Total (mg%) >250 10 125 135 <=250 21 449 470 Source: Shurtleff, D. The Framingham Study: An Epidemiologic investigation of Cardiovascular Disease, Section 26. Washington, DC, U.S. Government Printing Office. Cited in Kahn and Sempos (1989), p. 61, Table 3-8 Birthweights in a Clinical Trial to Test a Drug for Preventing Low Birthweights Baby Weight (lb) Patient Treatment group Control group 1 6.9 6.4 2 7.6 6.7 3 7.3 5.4 4 7.6 8.2 5 6.8 5.3 6 7.2 6.6 7 8.0 5.8 8 5.5 5.7 9 5.8 6.2 10 7.3 7.1 11 8.2 7.0 12 6.9 6.9 13 6.8 5.6 14 5.7 4.2 15 8.6 6.8 Source: Rosner, Table 8.7 Observed Rank of Treatments, by Effectiveness (Hypothetical) Treatment Triplet Group Drug Therapy Drug/Therapy 1 3 1 2 2 2 3 1 3 1 3 2 . . . . . . . . . . . . 31 2 1 3 Avg. rank 2.29 1.98 1.74