CHAPTER III-1 THE RESAMPLING METHOD FOR STATISTICAL INFERENCE INTRODUCTION This chapter introduces the resampling method in historical and theoretical perspective, and illustrates the method. About 1615, Italian gamblers brought to Galileo Galilei a problem in the game of three dice. The theorists of the day had figured as equal the chances of getting totals of 9 and 10 (also 11 and 12), because there are the same number of ways (six) of making those points -- for example, a nine can be 126, 135, 144, 234, 225, and 333. But players had found that in practice 10 is made more often than 9. How come? Galileo then invented the device of the "sample space" of possible outcomes. He colored three dice white, gray, and black, and systematically listed every possible permutation. The previ- ous theorists - including Gottfried Leibniz - had instead lumped together into a single category the various possible ways of getting (say) a 3, 3, and 4 to make 10. That is, they listed combinations rather than permutations, and various combinations contain different numbers of permutations. Galileo's analysis confirmed the gamblers' empirical re- sults. Ten does come up more frequently than 9, because there are 27 permutations that add to 10 whereas there are only 25 permutations that add to 9. The use of repeated trials to learn what the gamblers wanted to know illustrates the power of the resampling method -- which we can simply call "simulation" or "experimentation" here. And with sufficient repetitions, one can arrive at as accurate an answer as desired. Not only is the resampling method adequate, but in the case of three dice it was a better method than deduc- tive logic, because it gave the more correct answer. Though the only logic needed was enumeration of the possibilities, it was too difficult for the doctors of the day. The powers of a Gali- leo were necessary to produce the correct logic. Even after Galileo's achievement, the powers of Blaise Pascal and Pierre Fermat were needed to correctly analyze with the multiplication rule another such problem - the chance of at least one ace in four dice throws. (This problem, presented by the Chevalier de la Mere, is considered the origin of probability theory.) For lesser mathematical minds, the analysis was too difficult. Yet ordinary players were able to discern the correct relative probabilities, even though the differences in probabili- ties are slight in both the Galileo and Pascal-Fermat problems. Simulation's effectiveness is its best argument. One might rejoin that the situation is different after Gali- leo, Pascal, Fermat and their descendants have invented analytic methods to handle such problems correctly. Why not use already existing analytic methods instead of resampling? The existence of a correct algorithm does not imply that it will be used appropriately, however. And a wrongly-chosen algo- rithm is far worse than no algorithm at all -- as the Chevalier's pocketbook attested. In our own day, decades of experience have proven that "pluginski" -- the memorization of formulas that one cannot possibly understand intuitively -- may enable one to survive examinations, but does not provide usable scientific tools. THE DEFINITION OF RESAMPLING AGAIN Let's again define briefly "resampling". A statistical procedure models a physical process. A resampling method simulates the model with easy-to-manipulate symbols. Either the observed data (in the case of problem in statistical inference), or a data-generating mechanism such as a die (if it is a problem in probability), are used to produce new hypothetical samples, the properties of which can then be examined. The resampler postulates a universe and then examines how it behaves - in the case of statistics, comparing the outcomes to a criterion that we choose. More extended definition and discussion will be found in Chapter III-3. The same logic applies to problems labeled "probability" and "statistical inference". The only difference is that in proba- bility problems the "model" is known in advance -- say, the model implicit in a deck of poker cards plus a game's rules for dealing and counting the results -- rather than the model being assumed to be best estimated by the observed data, as in resampling statistics. (Every problem in statistical inference contains a problem in probability at its core.) Now that resampling has become respectable, some statisti- cians grumble that it is "just" the Monte Carlo method, or "just" simulation. But earlier on the body of resampling methods for statistics were not set forth under those labels. It is the overall approach - the propensity to turn first to resampling methods to handle practical problems - that most clearly distinguishes resampling from conventional statistics, and from the earlier use of Monte Carlo methods. (In the nine- teenth century, "simulation techniques were all tied to the normal distribution, and all involved generating errors to be added to a signal", according to historian Stephen Stigler.) In addition, some resampling methods are new in themselves, the result of the basic resample-it tendency of the past quarter century. RESAMPLING AND STATISTICAL INFERENCE Chapter II-2 mentioned how John Arbuthnot, doctor to Queen Anne of England, observed that more boys than girls were bornin London; records showed that male births exceeded female 82 years in a row. Arbuthnot therefore set forth to test the hypothesis that a universe with a 50-50 probability of producing males could result in 82 successive years with preponderantly male births. Arbuthnot used the multiplication rule of Pascal and Fermat to calculate that the probability of (1/2)82 is extremely small. "From whence it follows, that it is Art, not Chance, that "gov- erns" - that is, "Divine Providence". (His argument is complex and debatable, as statistical inference often is; the mathematics is the easy part, especially when resampling methods are used.) Please notice that Arbuthnot could have considered the numbers of boys and girls observed in each year, rather than treating each year as a single observation - an even stronger test because of the vast amounts of information. Arbuthnot surely did not analyze the data for any or all of the individual years because the calculus of probability was still in its infan- cy. Luckily, the test Arbuthnot made was more than powerful enough for his purposes. But if instead of 82 years in a row, only (say) 81 or 61 of the 82 years had shown a preponderance of males, Arbuthnot would have lacked the tools for a test (though he knew the binomial and logarithms). Nowadays, one conventional- ly uses the Gaussian (Normal) approximation to the binomial distribution to produce the desired probability. But that method requires acquaintance with a considerable body of statistical procedure, and utilizes a formula that almost no one knows and even fewer can explain intuitively. Instead, users simply "plug in" the data to a table which, because it is an arcane mystery, invites misuse and erroneous conclusions. The experimental resampling method of earlier gamblers could easily have given Arbuthnot a satisfactory answer for (say) 61 of 82 years, however. He had in fact likened the situation to a set of 82 coins. He could simply have tossed such a set repeatedly, and found that almost never would as many as 81 or 61 heads occur. He could then have rested as secure in his conclusion as with the formulaic assessment of the probability of 82 years in a row. And because of the intuitive clarity of the experimental method, one would not be likely to make a misleading error in such a procedure. By the grace of the computer, such problems can be handled more conveniently today. The self-explanatory commands in Figure III-1-1 suffice, using the language RESAMPLING STATS and pro- ducing the results shown there. Figure III-1-1 The intellectual advantage of the resampling method is that though it takes repeated samples from the sample space, it does not require that one know the size of the sample space or of a particular subset of it. To estimate the probability of getting (say) 61 males in 82 births with the binomial formula requires that one calculate the number of permutations of a total of 82 males and females, and the number of those permutations that include 61 or more males. In contrast, with a resampling ap- proach one needs to know only the conditions of producing a single trial yielding a male or female. This conceptual differ- ence, which will be discussed at greater length below, is the reason that, compared to conventional methods, resampling is likely to have higher "statistical utility" - a compound of efficiency plus the chance that the ordinary scientist or deci- sion-maker will use a correct procedure. VARIETIES OF RESAMPLING METHODS A resampling test may be constructed for every case of statistical inference - by definition. Every real-life situation can be modeled by symbols of some sort, and one may experiment with this model to obtain resampling trials. A resampling method should always be appropriate unless there are insufficient data to perform a useful resampling test, in which case a conventional test - which makes up for the absence of observations with an assumed theoretical distribution such as the Normal or Poisson - may produce more accurate results if the universe from which the data are selected resembles the chosen theoretical distribution. Exploration of the properties of resampling tests is an active field of research at present. Chapter III-2 will discuss the various types of resampling methods. For the main tasks in statistical inference - hypothesis testing and confidence intervals - the appropriate resampling test often is immediately obvious. For example, if one wishes to inquire whether baseball hitters exhibit behavior that fits the notion of a slump, one may simply produce hits and outs with a random-number generator adjusted to the batting average of a player, and then compare the number of simulated consecutive sequences of either hits or outs with the observed numbers for the player. The procedure is also straightforward for such bino- mial situations as the Arbuthnot birth-sex case. Two sorts of procedures are especially well-suited to resam- pling: 1) A sample of the permutations in Fisher's "exact" test (confusingly, also called a "randomization" test). This is appropriate when the size of the universe is properly assumed to be fixed, as discussed below. 2) The bootstrap procedure. This is appropriate when the size of the universe is properly assumed not to be fixed. Let's compare the permutation and bootstrap procedures in the context of a case which might be analyzed either way. The discussion will highlight some of the violent disagreements in the philosophy of statistics which the use of resampling methods frequently brings to the surface - one of its great benefits. In the 1960s I studied the price of liquor in the sixteen "monopoly" states (where the state government owns the retail liquor stores) compared to the twenty-six states in which retail liquor stores are privately owned. (Some states were omitted for technical reasons. The situation and the price pattern has changed radically since then.) These were the representative 1961 prices of a fifth of Sea- gram 7 Crown whiskey in the two sets of states: 16 monopoly states: $4.65, $4.55, $4.11, $4.15, $4.20, $4.55, $3.80, $4.00, $4.19, $4.75, $4.74, $4.50, $4.10, $4.00, $5.05, $4.20 26 private-ownership states: $4.82, $5.29, $4.89, $4.95, $4.55, $4.90, $5.25, $5.30, $4.29, $4.85, $4.54, $4.75, $4.85, $4.85, $4.50, $4.75, $4.79, $4.85, $4.79, $4.95, $4.95, $4.75, $5.20, $5.10, $4.80, $4.29. The economic question that underlay the investigation - having both theoretical and policy ramifications - is as follows: Does state ownership affect prices? The empirical question is whether the prices in the two sets of states were systematically different. In statistical terms, we wish to test the hypothesis that there was a difference between the groups of states related to their mode of liquor distribution, or whether instead the ob- served $.49 differential in means might well have occurred by happenstance. In other words, we want to know whether the two sub-groups of states differed systematically in their liquor prices, or whether the observed pattern could well have been produced by chance variability. At first I used a resampling permutation test as follows: Assuming that the entire universe of possible prices consists of the set of events that were observed, because that is all the information available about the universe, I wrote each of the forty-two observed state prices on a separate card. The shuffled deck simulated a situation in which each state has an equal chance for each price. On the "null hypothesis" that the two groups' prices do not reflect different price-setting mechanisms, but rather differ only by chance, I then examined how often that simulated universe stochastically produces groups with results as different as observed in 1961. I repeatedly dealt groups of 16 and 26 cards, without replacing the cards, to simulate hypothetical monopoly- state and private-state samples, each time calculating the dif- ference in mean prices. The probability that the benchmark null-hypothesis universe would produce a difference between groups as large or larger than observed in 1961 is estimated by how frequently the mean of the group of randomly-chosen sixteen prices from the simulated state- ownership universe is less than (or equal to) the mean of the actual sixteen state-ownership prices. If the simulated differ- ence between the randomly-chosen groups was frequently equal to or greater than observed in 1961, one would not conclude that the observed difference was due to the type of retailing system because it could well have been due to chance variation. The computer program in Figure III-1-2, using the language RESAMPLING STATS performs the operations described above (MATHE- MATICA and APL could be used in much the same fashion). Figure III-1-2 The results shown - not even one "success" in 10,000 trials - imply a very small probability that two groups with mean prices as different as were observed would happen by chance if drawn from the universe of 42 observed prices. So we "reject the null hypothesis" and instead find persuasive the proposition that the type of liquor distribution system influences the prices that consumers pay. As I shall discuss later, the logical framework of this resampling version of the permutation test differs greatly from the formulaic version, which would have required heavy computa- tion. The standard conventional alternative would be a Student's t-test, in which the user simply plugs into an unintuitive formu- la and table. And because of the unequal numbers of cases and unequal dispersions in the two samples, an appropriate t test is far from obvious, whereas resampling is not made more difficult by such realistic complications. A program to handle the liquor problem with an infinite- universe bootstrap distribution simply substitutes the random sampling command GENERATE for the TAKE command in Figure III-1-2. The results of the new test are indistinguishable from those in Figure III-1-2. INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL<1> Let's now consider one of the simplest numerical examples of probabilistic-statistical reasoning given toward the front of a standard book on medical statistics (Kahn and Sempos, 1989). Using data from the Framingham study, the authors ask: What is an appropriate "confidence interval" on the observed ratio of "relative risk" (a measure which is defined below; it is closely related to the odds ratio) of the development of myocardial infarction 16 years after the study began, for men ages 35-44 with serum cholesterol above 250, relative to those with serum cholesterol below 250? The raw data are shown in Table III-1-1 (divided into "high" and "low" cholesterol by Kahn and Sempos). Table III-1-1 Hypothesis Tests With Measured Data Consider this classic question about the Framingham serum cholesterol data: What is the degree of surety that there is a difference in myocardial infarction rates between the high- and low-cholesterol groups? The statistical logic begins by asking: How likely is that the two observed groups "really" came from the same "population" with respect to infarction rates? Operationally, we address this issue by asking how likely it is that two groups as different in disease rates as the observed groups would be produced by the same "statistical universe". Key step: we assume that the relevant "benchmark" or "null- hypothesis" population (universe) is the composite of the two observed groups. That is, if there were no "true" difference in infarction rates between the two serum-cholesterol groups, and the observed disease differences occurred just because of sampling variation, the most reasonable representation of the population from which they came is the composite of the two observed groups. Therefore, we compose a hypothetical "benchmark" universe containing (135 + 470 =) 605 men at risk, and designate (10 + 21 =) 31 of them as infarction cases. We want to determine how likely it is that a universe like this one would produce - just by chance - two groups that differ as much as do the actually observed groups. That is, how often would random sampling from this universe produce one sub-sample of 135 men containing a large enough number of infarctions, and the other sub-sample of 470 men producing few enough infarctions, that the difference in occurrence rates would be as high as the observed difference of .029? (10/135 = .074, and 21/470 = .045, and .074 - .045 = .029). So far, everything that has been said applies both to the conventional formulaic method and to the "new statistics" resampling method. But the logic is seldom explained to the reader of a piece of research - if indeed the researcher her/himself grasps what the formula is doing. And if one just grabs for a formula with a prayer that it is the right one, one need never analyze the statistical logic of the problem at hand. Now we tackle this problem with a method that you would think of yourself if you began with the following mind-set: How can I simulate the mechanism whose operation I wish to under- stand? These steps will do the job: 1. Fill an urn with 605 balls, 31 red (infarction) and the rest (605 - 31 = 574) green (no infarction). 2. Draw a sample of 135 (simulating the high serum- cholesterol group), one ball at a time and throwing it back after it is drawn to keep the simulated probability of an infarction the same throughout the sample; record the number of reds. Then do the same with another sample of 470 (the low serum-cholesterol group). 3. Calculate the difference in infarction rates for the two simulated groups, and compare it to the actual difference of .029; if the simulated difference is that large, record "Yes" for this trial; if not, record "No". 4. Repeat steps 2 and 3 until a total of (say) 400 or 1000 trials have been completed. Compute the frequency with which the simulated groups produce a difference as great as actually observed. This frequency is an estimate of the probability that a difference as great as actually observed in Framingham would occur even if serum cholesterol has no effect upon myocardial infarction. The procedure above can be carried out with balls in a ceramic urn in a few hours. Yet it is natural to seek the added convenience of the computer to draw the samples. Therefore, we illustrate in Figure III-1-4 how a simple computer program handles this problem. We use RESAMPLING STATS but the program can be executed in other languages as well, though usually with more complexity and less clarity. Figure III-1-4 The results of the test using this program may be seen in the histogram in Figure III-1-4. We find - perhaps surprisingly - that a difference as large as observed would occur by chance fully 10 percent of the time. (If we were not guided by the theoretical expectation that high serum cholesterol produces heart disease, we might include the 10 percent difference going in the other direction, giving a 20 percent chance). Even a ten percent chance is sufficient to strongly call into question the conclusion that high serum cholesterol is dangerous. At a minimum, this statistical result should call for more research before taking any strong action clinically or otherwise. Where should one look to determine which procedures should be used to deal with a problem such as set forth above? Unlike the formulaic approach, the basic source is not a manual which sets forth a menu of formulas together with sets of rules about when they are appropriate. Rather, you consult your own understanding about what it is that is happening in (say) the Framingham situation, and the question that needs to be answered, and then you construct a "model" that is as faithful to the facts as is possible. The urn-sampling described above is such a model for the case at hand. To connect up what we have done with the conventional approach, we apply a z test (conceptually similar to the t test, but applicable to yes-no data; it is the Normal-distribution approximation to the large binomial distribution) and we find that the results are much the same as the resampling result - an eleven percent probability. Someone may ask: Why do a resampling test when you can use a standard device such as a z or t test? The great advantage of resampling is that it avoids using the wrong method. The researcher is more likely to arrive at sound conclusions with resampling because s/he can understand what s/he is doing, instead of blindly grabbing a formula which may be in error. The textbook from which the problem is drawn is an excellent one; the difficulty of its presentation is an inescapable consequence of the formulaic approach to probability and statistics. The body of complex algebra and tables that only a rare expert understands down to the foundations constitutes an impenetrable wall to understanding. Yet without such understanding, there can be only rote practice, which leads to frustration and error. Confidence Intervals for the Counted Data So far we have discussed the interpretation of sample data for testing hypotheses. The devices used for the other main theme in statistical inference - the estimation of confidence intervals - are much the same as those used for testing hypothe- ses. Indeed, the bootstrap method discussed above was originally devised for estimation of confidence intervals. The bootstrap method may also be used to calculate the appropriate sample size for experiments and surveys, another important topic in statis- tics. Consider for now just the data for the sub-group of 135 high-cholesterol men. A second classic statistical question is as follows: How much confidence should we have that if we were take a much larger sample than was actually obtained, the sample mean (that is, the proportion 10/135 = .07) would be in some close vicinity of the observed sample mean? Let us first carry out a resampling procedure to answer the questions, waiting until afterwards to discuss the logic of the inference. 1. Construct an urn containing 135 balls - 10 red (infarction) and 125 green (no infarction) to simulate the universe as we guess it to be. 2. Mix, choose a ball, record its color, replace it, and repeat 135 times (to simulate a sample of 135 men). 3. Record the number of red balls among the 135 balls drawn. 4. Repeat steps 2-4 perhaps 1000 times, and observe how much the number of reds varies from sample to sample. We arbitrarily denote the boundary lines that include 47.5 percent of the hypothetical samples on each side of the sample mean as the 95 percent "confidence limits" around the mean of the actual population. Figure III-1-5 shows how this can be done easily on the computer, together with the results. Figure III-1-5 The variation in the histogram in Figure III-1-5 highlights the fact that a sample containing only 10 cases of infarction is very small, and the number of observed cases - or the proportion of cases - necessarily varies greatly from sample to sample. Perhaps the most important implication of this statistical analysis, then, is that we badly need to collect additional data. This is a classic problem in confidence intervals, found in all subject fields. For example, at the beginning of the first chapter of a best-selling book in business statistics, Wonnacott and Wonnacott use the example of a 1988 presidential poll. The language used in the cholesterol-infarction example above is exactly the same as the language used for the Bush-Dukakis poll except for labels and numbers. Also typically, the text gives a formula without explaining it, and says that it is "fully derived" eight chapters later (Wonnacott and Wonnacott, 1990, p. 5). With resampling, one never needs such a formula, and never needs to defer the explanation. The philosophic logic of confidence intervals is quite deep and controversial, less obvious than for the hypothesis test. The key idea is that we can estimate for any given universe the probability P that a sample's mean will fall within any given distance D of the universe's mean; we then turn this around and assume that if we know the sample mean, the probability is P that the universe mean is within distance D of it. This inversion is more slippery than it may seem. But the logic is exactly the same for the formulaic method and for resampling. The only difference is how one estimates the probabilities - either with a numerical resampling simulation (as here), or with a formula or other deductive mathematical device (such as counting and partitioning all the possibilities, as Galileo did when he answered a gambler's question about three dice.) And when one uses the resampling method, the probabilistic calculations are the least demanding part of the work. One then has mental capacity available to focus on the crucial part of the job - framing the original question soundly, choosing a model for the facts so as to properly resemble the actual situation, and drawing appropriate inferences from the simulation. If you have understood the general logic of the procedures used up until this point, you are in command of all the necessary conceptual knowledge to construct your own tests to answer any statistical question. A lot more practice, working on a variety of problems, obviously would help. But the key elements are simple: 1) Model the real situation accurately, 2) experiment with the model, and 3) compare the results of the model with the observed results. Confidence Intervals on Relative Risk With Resampling Now we are ready to calculate - with full understanding - the confidence intervals on relative risk that the Kahn-Sempos text sought. Recall that the observed sample of 135 high- cholesterol men had 10 infarctions (a proportion of .074), and the sample of 470 low-cholesterol men had 21 infarctions (a proportion of .045). We estimate the relative risk of high cholesterol as .074/.045. Let us frame the question this way: If we were to randomly draw a sample from the universe of high- cholesterol men that is best estimated from our data (.074 infarctions), and a sample from the universe of low-cholesterol men (.045 infarctions), and do this again and again, within which bounds would the relative risk calculated from that simulation fall (say) 95 percent of the time? The operation is quite the same as that for a single confidence interval estimated above except that we do the operation for both sub-samples at once, and then calculate the ratio between their results to determine the relative risk. As before, we would like to know what would happen if we could take additional samples from the universes that spawned our actual samples. Lacking the resources to do so, we let those original samples "stand in" for the universes from which they came, serving as proxy "substitute universes." It is as if we replicate each sample element a million times and then take "bootstrap" samples from this "proxy universe." Paralleling the real world, we take simulated samples of the same size as our original samples. (In practice we need not replicate each sample element a million times but instead achieve the same resampling effect by sampling with replacement from our original samples -- that way, the chance that a sample element will be drawn remains the same from draw to draw.) We count the number of infarctions in each of our resamples, and for the pair of resamples, then calculate the relative risk measure and keep score of this re- sult. We repeat with additional pairs of resamples, each time calculating the relative risk measure, and examine the overall results. We may compare our results in Figure III-1-6 - a confidence interval extending from 0.69 to 3.4 - to the results given by Kahn and Sempos, which are 0.79 to 3.5, 0.80 to 3.4, and 0.79 to 3.7 from three different formulas (pp. 62-63); our agreement is close. Figure III-1-6 OTHER RESAMPLING TECHNIQUES We have so far seen examples of three of the most common resampling methods - binomial, permutation, and bootstrap. These methods may be extended to handle correlation, regression, and tests where there are three or more groups. Indeed, resampling can be used for every other statistic in which one may be inter- ested - for example, statistics based on absolute deviations rather than squared deviations. This flexibility is a great virtue because it frees the statistics user from the limited and oft-confining battery of textbook methods. SOME OTHER ILLUSTRATIONS A Measured-Data Example: Test of a Drug to Prevent Low Birthweight The Framingham infarction-cholesterol examples worked with yes-no "count" data. Let us therefore consider some illustrations of the use of resampling with measured data. Another leading textbook (Rosner, 1982, p. 257) gives the example of a test of the hypothesis that drug A prevents low birthweights. The data for the treatment and control groups are shown in Table III-1-2. The treatment group averaged .82 pounds more than the control group. Here is a resampling approach to the problem: Table III-1-2 1. If the drug has no effect, our best guess about the "universe" of birthweights is that it is composed of (say) a million each of the observed weights, all lumped together. In other words, in the absence of any other information or compelling theory, we assume that the combination of our samples is our best estimate of the universe. Hence let us write each of the birthweights on a card, and put them into a hat. Drawing them one by one and then replacing them is the operational equiv- alent of a very large (but equal) number of each birthweight. 2. Repeatedly draw two samples of 15 birthweights each, and check how frequently the observed difference is as large as, or larger than, the actual difference of .82 pounds. We find in Figure III-1-7 that only 1% of the pairs of hypothetical resamples produced means that differed by as much as .82. We therefore conclude that the observed difference is unlikely to have occurred by chance. Figure III-1-7 Matched-Patients Test of Three Treatments There have been several recent three-way tests of treatments for depression: drug versus cognitive therapy versus combined drug and cognitive therapy. Consider this procedure for a proposed test in 31 triplets of people that have been matched within triplet by sex, age, and years of education. The three treatments are to be chosen randomly within each triplet. Assume that the outcomes on a series of tests were ranked from best (#1) to worst (#3) within each triplet, and assume that the combined drug-and-therapy regime has the highest average rank. How sure can we be that the observed result would not occur by chance? In hypothetical Table III-1-3 the average rank for the drug and therapy regime is 1.74. Is it likely that the regimes do not "really" differ with respect to effectiveness, and that the drug and therapy regime came out with the best rank just by the luck of the draw? We test by asking, "If there is no difference, what is the probability that the treatment of interest will get an average rank this good, just by chance?" Table III-1-3 Figure III-1-8 shows a program for a resampling procedure that repeatedly produces 31 ranks randomly selected among the numbers 1, 2 and 3, and then averages the ranks. We can then observe whether an average of 1.74 is unusually low, and hence should not be ascribed to chance. Figure III-1-8 In 1000 repetitions of the simulation (10,000 would take just a few moments longer), 5% yielded average ranks as low as the observed value. This is evidence that something besides chance might be at work here. (The result is at the borderline of the traditional 5% "level of significance" (a p-value of .05), supposedly set arbitrarily by the great biostatistician R.A. Fisher on the grounds that a 1-in-20 happening is too coincidental to ignore.) That is, the resampling test suggests that it would be very unlikely for a given treatment regime to achieve, just by chance, results as good as are actually observed. An interesting feature of the treatment problem is that it would be hard to find a conventional test that would handle this three-way comparison in an efficient manner. Certainly it would be impossible to find a test not requiring formulae and tables that only a talented professional statistician could manage satisfactorily, and even s/he is not likely to fully understand those formulaic procedures. THE COMPUTER AND RESAMPLING Some now refer to resampling as "computer-intensive statistics" (e. g. Noreen, 1986). And others have written that resampling had to await the easy availability of computers. It is, however, arguable that computer cooperation is a crucial element of the sampling method. Resampling operations (including the bootstrap and permutation procedures) can often be conducted quite satisfactorily with simpler tools. Indeed, the permuation test for the liquor prices example above was indeed done by hand with cards, and a bootstrap test could similarly have been done. Nevertheless, the inconvenience of doing tests by hand was a barrier to implementation and adoption. Therefore, in the early 1970s developed Dan Weidenfeld and I a computer language and a program for the mainframe that carries out resampling operations (including permutation tests, the bootstrap, and just about every other device) more expeditiously than simpler tools such as coins, dice, and random-number tables (Simon and Weidenfeld, 1974); this is the same language that today is marketed for the personal computer under the name Resampling Stats. More about the computer and resampling in the next chapter. ON THE NATURE OF RESAMPLING TESTS Resampling is a much simpler intellectual task than the formulaic method because simulation obviates the need to calcu- late the number of points in the entire sample space. This subject is explored in Chapter 00. REPEAT 1000 GENERATE 82 1,2 A Generate randomly 82 1s (males) or 2s COUNT A =1 B Count the males SCORE B Z Keep score of trial results END HISTOGRAM Z COUNT Z >= 61 K + + * + * + *** 75+ * **** F + ******* r + ******* e + ******* q + *********** u 50+ *********** e + *********** n + ************ c + ************** y + **************** 25+ **************** + **************** + ****************** + ****************** + *********************** * 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 20 30 40 50 60 Number of "male-predominant" years K = 0 Figure III-3-1 NUMBERS (482 529 489 495 455 490 525 530 429 485 454 475 485 485 450 475 479 485 479 495 495 475 520 510 480 429) A NUMBERS (465 455 411 415 420 455 380 400 419 475 474 450 410 400 505 420) B CONCAT A B C Join the two vectors of data REPEAT 1000 Repeat 1000 simulation trials SHUFFLE C D Shuffle the 42 state prices TAKE D 1,26 E Take 26 for the "private" group TAKE D 27,42 F Take the other 16 for the "monopoly" group MEAN E EE Find the mean of the "private" group. MEAN F FF Mean of the "monopoly" group SUBTRACT EE FF G Difference in the means SCORE G Z Keep score of the trials END HISTOGRAM Z Graph of simulation results to compare with the observed result 75+ F + * r + ****** e + ****** q + ******* u 50+ * ******** e + * ********* * n + ************* c + ************** y + * ************** 25+ ***************** * + ******************** + ********************** + ************************ + ******************************* 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| -40 -20 0 20 40 Difference in average prices (cents) (Actual difference = $0.49) Figure III-3-2 URN 31#1 574#2 men An urn called "men" with 31 "1s" (=infarctions) and 574 "2s" (=no infarction) SAMPLE 135 men high Sample (with replacement!) 135 of the numbers in this urn, give this group the name "high" SAMPLE 470 men low Same for a group of 470, call it "low" COUNT high =1 a Count infarctions in first group DIVIDE a 135 aa Express as a proportion COUNT low =1 b Count infarctions in second group DIVIDE b 470 bb Express as a proportion SUBTRACT aa bb c Find the difference in infarction rates SCORE c z Keep score of this difference END HISTOGRAM z COUNT z >=.029 k How often was the resampled difference >= the observed difference? DIVIDE k 1000 kk Convert this result to a proportion PRINT kk 200+ + + F + r + e 150+ q + u + e + n + ** c 100+ ** y + ** *** + ** *** * + ****** + ****** * Z 50+ *********** + *********** + ************ ** + **************** + ********************* 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| -0.1 -0.05 0 0.05 0.1 Difference between paired resamples (proportion with infarction) kk = 0.102 (the proportion of resample pairs with a difference >= .029) Figure III-1-4: Test for Differences in Infarctions URN 10#1 125#0 men An urn (called "men") with ten "1s" (infarctions) and 125 "0s" (no infarction) REPEAT 1000 Do 1000 trials SAMPLE 135 men a Sample (with replacement) 135 numbers from the urn, put them in "a" COUNT a =1 b Count the infarctions DIVIDE b 135 c Express as a proportion SCORE c z Keep score of the result END End the trial, go back and repeat HISTOGRAM z Produce a histogram of all trial results PERCENTILE z (2.5 97.5) k Determine the 2.5th and 97.5th percentiles of all trial results; these points enclose 95% of the results PRINT k F + r + e 150+ q + * u + * * e + * ** n + ** ** c 100+ ** ** * y + * ** ** * + * ** ** ** * + * ** ** ** + * ** ** ** Z 50+ * ** ** ** + * ** ** ** ** ** + * ** ** ** ** ** * + ** ** ** ** ** ** * + ** ** ** ** ** ** ** * 0+------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 0 0.05 0.1 0.15 0.2 Proportion with infarction k = 0.037037 0.11852 (This is the 95% confidence interval, enclosing 95% of the resam- ple results) Figure III-1-5: Confidence Interval Around Mean Difference URN 10#1 125#0 high The universe of 135 high cholesterol men, 10 of whom ("1s") have infarc- tions URN 21#1 449#0 low The universe of 470 low cholesterol men, 21 of whom ("1s") have infarc- tions REPEAT 1000 Repeat the steps that follow 1000 times SAMPLE 135 high high$ Sample 135 (with replacement) from the high cholesterol universe, and put them in "high$" [the "$" suffix just indicates a resampled counterpart to the actual sample] SAMPLE 470 low low$ Similarly for 470 from the low cholesterol universe COUNT high$ =1 a Count the infarctions in the first resampled group DIVIDE a 135 aa Convert to a proportion COUNT low$ =1 b Count the infarctions in the second resampled group DIVIDE b 470 bb Convert to a proportion DIVIDE aa bb c Divide the proportions to calculate relative risk SCORE c z Keep score of this result END End the trial, go back and repeat HISTOGRAM z Produce a histogram of trial results PERCENTILE z (2.5 97.5) k Find the percentiles that bound 95% of the trial results PRINT k F + * r + * e 75+ * * q + * * u + * * e + **** * * n + **** * * c 50+ * ****** * y + * ********* + ************ * + ************* * + **************** * Z 25+ **************** * + ******************** + * ******************** * + *********************** * * + ****************************** * * * 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 0 1 2 3 4 5 6 Relative risk Results (estimated 95% confidence interval): k = 0.68507 3.3944 Figure III-1-6: Confidence Interval Around Relative Risk NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7 8.6) treat NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2 6.8) control CONCAT treat control all Combine all birthweight observa- tions in same vector REPEAT 1000 Do 1000 simulations SAMPLE 15 all treat$ Take a resample of 15 from all birthweights (the $ indicates a resampling counterpart to a real sample) SAMPLE 15 all control$ Take a second, similar resample MEAN treat$ mt Find the means of the two resamples MEAN control$ mc SUBTRACT mt mc dif Find the difference between the means of the two resamples SCORE dif z Keep score of the result END End the simulation experiment, go back and repeat HISTOGRAM z Produce a histogram of the resample differences COUNT z >= .82 k How often did resample differences exceed the observed difference of .82? F + r + e 75+ q + u + e + n + * * * * c 50+ * * * *** * y + *********** + ************ * * + *************** + **************** Z 25+ * ******************* * + * ********************* + ** ********************** + ****************************** + *********************************** 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| -1.5 -1 -0.5 0 0.5 1 1.5 Resample differences in pounds Result: Only 1.3% of the pairs of resamples produced means that differed by as much as .82. We can conclude that the observed difference is unlikely to have occurred by chance. Figure III-1-7: Test for Birthweight Differences REPEAT 1000 Do 1000 simulations GENERATE 31 (1 2 3) ranks Generate 31 numbers, each number a "1", "2" or "3", to simulate random assignment of ranks 1-3 to the drug/ therapy alternative MEAN ranks rankmean Take the mean of these 31 SCORE rankmean z Keep score of the mean END End the simulation, go back and repeat HISTOGRAM z Produce a histogram of the rank means COUNT z <=1.74 k How often is mean rank better than 1.74, the observed value? PRINT k 100+ + * * + * * F + * * * r + * ** * * e 75+ * ** * * q + ** ** * ** * u + ** ** * ** * e + ** ** * ** * n + ** ** * ** ** c 50+ ** ** * ** ** y + * ** ** * ** ** * + * ** ** * ** ** * * + * * ** ** * ** ** * + ** * ** ** * ** ** * * Z 25+ ** * ** ** * ** ** * ** + * ** * ** ** * ** ** * ** + * ** * ** ** * ** ** * ** * * + * * ** * ** ** * ** ** * ** * * * + * ** ** * ** * ** ** * ** ** * ** * ** * 0+--------------------------------------------------------------- |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^| 1.4 1.6 1.8 2 2.2 2.4 2.6 Figure III-1-8: Test for Improvement by Combined Depression Therapy Table III-1-1 Development of Mycardial Infarction in Framingham After 16 Years Men Age 35-44, by Level of Serum Cholesterol Serum cholesterol Developed MI Did not develop MI Total (mg%) >250 10 125 135 <=250 21 449 470 Source: Shurtleff, D. The Framingham Study: An Epidemiologic investigation of Cardiovascular Disease, Section 26. Washington, DC, U.S. Government Printing Office. Cited in Kahn and Sempos (1989), p. 61, Table 3-8 Figure III-1-8: Test for Improvement by Combined Depression Therapy Table III-1-2 Birthweights in a Clinical Trial to Test a Drug for Preventing Low Birthweights Baby Weight (lb) Patient Treatment group Control group 1 6.9 6.4 2 7.6 6.7 3 7.3 5.4 4 7.6 8.2 5 6.8 5.3 6 7.2 6.6 7 8.0 5.8 8 5.5 5.7 9 5.8 6.2 10 7.3 7.1 11 8.2 7.0 12 6.9 6.9 13 6.8 5.6 14 5.7 4.2 15 8.6 6.8 Source: Rosner, Table 8.7 Figure III-1-8: Test for Improvement by Combined Depression Therapy Table III-1-3 Observed Rank of Depression Treatments, by Effectiveness (Hypothetical) Treatment Triplet Group Drug Therapy Drug/Therapy 1 3 1 2 2 2 3 1 3 1 3 2 . . . . . . . . . . . . 31 2 1 3 Avg. rank 2.29 1.98 1.74 Figure III-1-8: Test for Improvement by Combined Depression Therapy ENDNOTES **ENDNOTES** <1>: The following biostatistical examples are joint work with Peter C. Bruce. Figure III-1-8: Test for Improvement by Combined Depression Therapy