CHAPTER III-2 BACKGROUND AND ANALYSIS OF THE RESAMPLING METHOD INTRODUCTION The term "resampling" has been applied to a variety of techniques for statistical inference, among which stochastic permutation and the bootstrap are the most characteristic. Resampling methods are evolving rapidly, and their scopes and interrelationships are not always clear. Therefore, the aim of this chapter is to distinguish the various techniques falling under the resampling rubric from other related techniques, in order to aid discussion of the set of methods. There are two domains corresponding to the term "resampling". The wider domain includes all uses of simulation techniques for statistical inference (though not uses of simulation for the development of other techniques). The narrower sub-domain includes only simulation techniques that reuse the observed data to constitute a universe from which to draw repeated samples (without replacement = permutation techniques, with replacement = bootstrap techniques). Another statement of definition of the narrower quintessential domain: Those techniques that a) use in their entirety (though not necessarily with replacement) the sample data to repeatedly produce hypothetical samples, either by drawing subsamples stochastically or by rearranging the original observations stochastically, and then b) compare the results of those simulation samples to the observed sample. It will not always be easy to keep these two domains - or domain and sub-domain - clearly distinguished. Additionally, I argue that it is often useful to extend the term "resampling" beyond statistical inference and into probability, where one generates the simulation samples from a known device rather than from an unknown universe estimated by the observed data, because the mathematical simulation processes are identical in probability and statistics. Because resampling is still in its early stages, there is little consensus about its definitions as well as its practices, which means that the discussion will inevitably have many loose ends and be open to many rebuttals. But I hope that the reader will view the vigor and yeastiness of the controversy as indicating that this is the beginning of a discussion where there are important issues to be discussed, rather than concluding that the discussion that follows is unsatisfactory because it is subject to so many criticisms. (Indeed, absence of loose ends typically indicates that a topic is so settled that no further discussion is needed.) The appropriate question, as I see it, is not whether there are flaws in the discussion to follow, but rather whether the issues deserve to be aired in public where they can be thrashed out. The next section discusses the intellectual paths that have led to the general resampling method. The following section provides a classification of resampling methods and discusses their characteristics. After that comes a section of comment. ROADS TO RESAMPLING Several quite different intellectual roads have led to the body of methods called "resampling" as of 1996. The fact that such different roads lead to the same place may be considered empirical evidence for the inevitability of the general approach to inferential statistics, even before high-speed computers became commonplace and cheap. Dwass, and Chung-Fraser: Approximation of Classical Method The first publication of any of the techniques that now make up the resampling kit bag was by Meyer Dwass in 1957, and by J. H. Chung and D. A. S. Fraser in 1958. Both papers pointed to the value of Fisher's permutation test (1935; see also Pitman's advances in the direction begun by Fisher; 1937, pp. 322-335) misleadingly called the "randomization" technique. Both noted that with a large sample the "exact" Fisher test is not feasible because of the computational difficulty (before the age of powerful computers). They then suggested that a randomly- generated selected sub-set of the possible permutations could provide the benefits of the permutation test without excessive computational cost. The underlying idea was to use the power of sampling, in a fashion similar to the way it is used in empirical samples from large universes of data, in order to approximate the ideal test based on the complete set of permutations. And they showed that the approximation would be quite satisfactory. So their vision was to gain the benefits of the classical array of methods - though not a parametric test in this case - by the technical device of simulation sampling. Unfortunately, this technical trick does not work in the case of parametric tests themselves, because there is no way to use the device of sampling to replace the formulaic basis and the tabular superstructure of such methods as the Normal-based z test or the t-test (though the stochastic permutation test may be seen as a substitute for the t-test and hence a way of evading its use). Therefore, the path opened by Dwass and Chung-Fraser does not immediately broaden out intellectually into the resampling highway, though Draper and Stoneman built on the earlier work with an application of the permutation test to regression (1966). The first method used for comparison of liquor prices in Chapter III-1 is a permutation test. It should be noted that the stochastic permutation test is one of the two central resampling techniques; it is true "resampling" in the sense of treated the observed data as the best guess about the nature of the universe of interest, and then re-using those data as the basis for experimental samples. It should also be noted that stochastic permutation and the bootstrap are identical except for whether or not the samples are taken with replacement, and the two methods converge toward the same result as observed sample size increases; in many applications it is difficult or impossible to establish a clear philosophic justification for the use of one or the other technique. There is one conceptual difference, however; the stochastic permutation may be plausibly be seen as a sampling "approximation" to an "exact" technique; no such notion is possible for the bootstrap, so the latter is even more distant from the conventional approach than is the former. The inherent sensibleness of a stochastic permutation test is evidenced by its independent simultaneous discovery and also by its later re-discovery by Feinstein (1973), and by me (1969). Additional evidence of this is the independent re-discovery of the same stochastic permutation principle in the somewhat different context of tests of significance with survival data in 1970 by Forsythe and Frey. The foregoing writers viewed a stochastic simulation as a less exact approximation of the ideal test. There was no mention that the essential nature of simulation differs from formulaic tests in not requiring counting the points in the sample space, the central element in probability theory. Barnard's Test for the Fit to a Distribution In a few brief paragraphs in a comment in 1963 that is difficult to find even with the citation in hand, Barnard (1963)suggested a simulation test for how well a given sample fits a theoretical distribution - specifically, a runs test based on comparison to the results of drawings from a (horizontal) distribution - and envisioned doing the work with the aid of a computer. As with Dwass and Chung-Fraser, Barnard was offering a simulation technique as a inexact substitute for a formulaic method when the formulaic method is infeasible. This was stated clearly in a comment by Hope (1968) in the context of further work on Barnard's test. Hope recommended that Monte Carlo tests not be the tool of first resort. "It is preferable to use a known test of good efficiency instead of a Monte Carlo test procedure ..." (Hope, 1968, p. 582). Hope did go further, however, and recommended that a resampling test be used when "the necessary conditions for applying the [conventional] test may not be satisfied, or the underlying distribution may be unknown or it may be difficult to decide on an appropriate test criterion. Also, it is possible that only a physical model can be obtained which cannot be expressed in mathematical terms" (p. 582), But resampling still is seen as a second-best method. [1] Barnard's test is in the penumbra of resampling because it uses an independent device - a coin, or random numbers - to generate the trial simulation samples. It may be viewed either as a third category intermediate between formulaic methods and core resampling, or as a member of the larger resampling domain but not of the core sub-domain. An Overall Approach to Inference, and the Bootstrap In 1967 (Simon, 1969a, 1969b, chapters 23-25; 3rd edition with Paul Burstein, 1985)) I developed the resampling method in a very general context, starting with first principles of simulation and statistical inference rather than with any particular formulaic device. I illustrated the general idea and showed its breadth and power with a variety of methods (including the bootstrap and the stochastic permutation test and many others) for a range of problems including hypothesis tests, confidence intervals, fits to distributions, fixing of sample size, and other statistical needs. The intellectual basis was the centuries-old practice of experimentation to learn the odds in gambling games, together with the idea of Monte Carlo simulations of complex physical phenomena at Rand during and after World War II (see Ulam, 1976, pp. 196-199; Metropolis and Ulam, 1949; for their group, Monte Carlo was "a statistical approach to the study of differential equations", a device for dealing with the "completely intractable task...in closed form" [pp. 335, 337, 338], that is, a taking from statistics, whereas for me the method was an approach to statistical practice, a giving to statistics). I referred to the work as "Monte Carlo" and I wrote that it departed from the earlier work of Dwass and Chung-Fraser (seen as examples of the practice of resampling at large) and at Rand in two main ways: 1) I dealt with simple problems where the value of the technique's use was that persons could arrive at sound solutions that are perfectly understandable rather than using mysterious formulas that are often wrongly chosen; this included problems in probability as well as statistical inference. 2) Despite illustrating the use of the same general method on probabilistic problems, I focused on the problems of statistical inference rather than those considered in studies of probability, and mapped out the entire range of applied problems in inferential statistics; this emphasis on a new general technique to be used across the board, rather than a single particular device to be used in a particular situation; was the most radical innovation and the aspect of the work that evoked (and still evokes) the most resistance because it calls into question the existing body of formulaic methods. When first developing this material I was not aware of the work of Dwass and of Chung and Fraser and hence, like Feinstein after me (1973), I re-invented their idea2. I subsequently attributed to them the entire vision of a Monte Carlo approach to statistical inference, though in retrospect it can be seen that their view of the matter was much less general. When in 1976, I (with Atkinson and Shevokas in 1976) published results of controlled experiments showing that persons arrive at more correct answers to basic statistical problems when they are taught and employ resampling methods than conventional formulaic methods, I wrote: "It must be emphasized that the Monte Carlo method as described here really is intended as an alternative to conventional analytic methods in actual problem- solving practice...the simple Monte Carlo method described here is complete in itself for handling most - perhaps all - problems in probability and statistics" (1976, p. 734). I believe that that statement, with the vision that it expresses, is the radical departure from previous thought on the practice of statistics. The development of particular techniques is subsidiary. It seems to me that this general statement is the most important element in the resampling approach. The Bootstrap: Re-sampling with Replacement The prices of liquor in private-enterprise and state-owned systems discussed in Chapter III-1 can also be done with a process of sampling with replacement rather than permuation; such a test was dubbed the "bootstrap" by Bradley Efron in 1979. It was first published in three examples in Simon (1969b and in further discussion in correspondence with Kruskal, 1969, and was in common use at the University of Illinois in the early 1970s, being the only method used for hypothesis-testing in the 1976 text by Atkinson, Shevokas, and Travers. Recently I have concluded that a bootstrap-type test has better theoretical justification than a permutation test in this case, though the two reach almost identical results with a sample this large. The following discussion of which is most appropri- ate brings out the underlying natures of the two approaches, and illustrates how resampling raises issues which tend to be buried amidst the technical complexity of the formulaic methods, and hence are seldom discussed in print. Imagine a class of 42 students, 16 men and 26 women who come into the room and sit in 42 fixed seats. We measure the distance of each seat to the lecturer, and assign each a rank. The women sit in ranks 1-5, 7-20, etc., and the men in ranks 6, 22, 25-26, etc. You ask: Is there a relationship between sex and ranked distance from the front? Here the permutation procedure that resamples without replacement - as used above with the state liquor prices - quite clearly is appropriate. Now, how about if we work with actual distances from the front? If there are only 42 seats and they are fixed, the permu- tation test and sampling without replacement again is appropri- ate. But how about if seats are movable? Consider the possible situation in which one student can choose position without reference to others. That is, if the seats are movable, it is not only imaginable that A would be sitting where B now is, with B in A's present seat - as was the case with the fixed chairs - but A could now change distance from the lecturer while all the others remain as they are. Sampling with replacement now is appropriate. (To use a technical term, the cardinal data provide more actual degrees of freedom - more information - than do the ranks). Note that (as with the liquor prices) the seat distances do not comprise an infinite population. Rather, we are inquiring whether a) the universe should be considered limited to a given number of elements, or b) could be considered expandable without change in the probabilities; the latter is a useful definition of "sampling with replacement". As of 1996, the U.S. state liquor systems seem to me to resemble a non-fixed universe (like non-fixed chairs) even though the actual number of states is presently fixed. The question the research asked was whether the liquor system affects the price of liquor. We can imagine another state being admitted to the union, or one of the existing states changing its system, and pondering how the choice of system will affect the price. And there is no reason to believe that (at least in the short run) the newly-made choice of system would affect the other states' pricing; hence it makes sense to sample with replacement (and use the bootstrap) even though the number of states clearly is not infinite or greatly expandable. In short, the presence of interaction - a change in one entity causing another entity also to change - implies a finite universe composed of those elements, and use of a permutation test. Conversely, when one entity can change independently, an infinite universe and sampling with replacement with a bootstrap test is indicated. Efron's Route to the Bootstrap and Development of It Efron connects his work - at first, his apparent rediscovery of the bootstrap and later his wider applications of resampling - to the jackknife. "Historically the subject begins with the Quenouille-Tukey jackknife, which is where we will begin also" (1979; 1982, p. 1). This connection to the jackknife is immediately obvious in the title of his first work on the subject, and in his discussions on the subject since then. Diaconis and Efron later wrote: "There are close theoretical connections among the methods [cross-validation, jackknife, bootstrap]. One line of thinking develops them all, as well as several others, from the bootstrap" (1983, p. 130). But this statement refers to the logical connections, which are the reverse of Efron's historical process. However, the jackknife (Quenouille, 1956, and Tukey, 1958) and cross-validation (or "sample splitting"; see Mosier, 1951, pp. 5-11), are entirely outside the definitions of resampling, whether narrow or broad. They are connected with each other and with the bootstrap by a very different line of thinking than the concept of resampling; rather, they share the common aim of inferring reliability. That is, the motivations for the Quenouille and Tukey in inventing the jackknife, and for Efron in developing the bootstrap, may have been similar. The natures of the devices are very different, to wit: Cross-validation separates the available data into two or more segments, and then tests the model generated in one segment against the data in the other segment(s); clearly there is no re- use of the same data, nor is there any use of repeated simulation trials. The jackknife - which, as Low (entry in the Encyclopedia of Statistical Science, Kotz & Johnson, v. 8, 1983) notes, "reduce[s] the size of the sample [really, the resampling universe] in each of the re-computations of the statistic", does not use the data in their entirety for each trial3, a key characteristic of resampling. That is, the observations omitted from the experimental samples are designated systematically, whereas in resampling (see definition below) observations are probabilistically omitted from the experimental samples. To put it differently, every jackknife analysis of a given set of data produces the same result, unlike resampling processes (assuming no problem with the seed in the random-number generator). The jackknife has in common with the resampling techniques discussed here the partial re-use of data, but it does not a) resample from it, or b) use all of it for any given sample. The jackknife has more in common with such scientific practices as examining the results when leaving off the extreme observations in a sample -- as is suggested visually by some of Tukey's graphic techniques -- than it does with other methods included in the present definition of resampling. (The jackknife also makes use of the t distribution, which also puts it outside of the basic definitions of resampling as discussed below.) Indeed, though Efron (quoted above) came to the bootstrap and then resampling more generally by way of the jackknife, he notes: "In fact it would be more logical to begin with the bootstrap..." (1982, p. 1). (And indeed, discussion of the jackknife has diminished severely over time in connection with resampling.) But it is even more logical to begin with the general vision of resampling as embodied in the definition given here and in the wide range of techniques shown in my 1969 book. The literature has been moving more and more toward that vision, as I read the literature. And there is some movement in introductory texts to present resampling techniques as tools of first resort rather than tool to which one should turn only when stymied in the search for formulaic methods. There seems to have been no connection between Efron's development and the concept of Monte Carlo simulation, and simulation in general; the index of Efron and Tibshirani (1993) lists them only with reference to specific practices in a few particular bootstrap applications, with no reference to Stan Ulam (the putative father or the Monte Carlo method and label at Rand). Nor is there connection between Efron's work and that of Dwass and Chung-Fraser; to my knowledge, they are never referred to in his writings (though I have not examined them all). At the end of this survey of the origins of resampling, it is interesting to note that the long tradition of experimental studies of distributions and properties of estimators in statistics and econometrics, with Student being an early distinguished example, did not enter into the thinking of any of the intellectual streams discussed above. Nor did the use of simulation for pedagogical illustration such as the sampling distribution of the sample mean. THE CHARACTERISTICS AND THE CLASSIFICATION OF METHODS The previous section briefly described resampling, giving both a core definition and a wider definition. This section goes into more detail about the characteristics of resampling techniques in contrast to techniques that are outside the resampling domain(s). Re-use of the Available Data to Generate Repeated Samples Systematic re-use of the available data is the central characteristic of resampling, and it is at the core of the following core definition of resampling: Use in their entirety (though not necessarily with replacement) the observed data to repeatedly produce experimental samples, either by drawing subsamples stochastically or by rearranging the original observations stochastically, and then compare the results of those simulation-trial samples to the original sample data. Consider this Efron-Tibshirani definition of the bootstrap: "A bootstrap sample x* = (x1* ...) is obtained by randomly sampling n times, with replacement, from the original data points x1 ..." Another similar and very clear statement elsewhere: "Each bootstrap sample has n elements, generated by sampling with replacement n times from the original data set" (p. 13, below their Figure 2.1). If we now amend that definition by writing "with or without replacement", the permutation test is included and the definition is a formal and precise description of core resampling methods4. The wider definition of resampling also includes other uses of simulation techniques for statistical inference that generate samples by random drawings from distributions derived other than from the observed data - for example, Barnard's drawings from a horizontal distribution against which to compare an observed sample. This wider definition would seem to encompass devices to serve all or most of the purposes of statistical inference, and such a wide definition was the basis of the suggestion made explicitly in Simon, Atkinson, and Shevokas (1976) that resampling be thought of as the first option in all situations. If I were to choose a label for the wider domain, I would call it the "best-guess-universe" method. This not only has the virtue of including inferential simulation methods other than those that re-use the data, but this label also points up that when one has a better guess about the universe than just the observed data - when the data are very few, for example, and other information or assumptions should be used in a Bayesian spirit - one should then not be limited to the use of the observed data. One can also broaden the definition of core resampling to include not only problems in probabilistic statistics but also problems in probability, by including the phrase "or the data- generating mechanism (such as a die)" after "observed data" in the definition above. Problems in pure probability may at first seem different in nature than the probabilistic-statistical (inverse probability) problems, and foreign to the concerns of statisticians. But the same logic as stated in the definition above applies to problems in probability as to problems in inferential statistics. The only difference is that in probability problems the model is known in advance -- say, the model implicit in a deck of cards plus a game's rules for dealing and counting the results -- rather than the model being assumed to be inferred, and best described by the observed data, as in resampling statistics. Efron has given a definition in the same spirit: "You use the data to estimate probabilities and then you pick yourself up by your bootstraps and see how variable the data are in that framework" (Science, 13 July, 1984, p. 157). Though Efron focuses upon the variability of a sample statistic in this definition, the centrality of re-use is apparent. (And though he was referring only to the bootstrap technique, this definition obviously applies to permutation tests as well.) It has been noted earlier that the jackknife and cross- validation do not fit the definition of resampling. Nor do other standard closed-form methods in inference. Non-Use of the Gaussian Distribution The non-use of the Normal distribution is another of the central characteristics of resampling. The Gaussian characteristic separates the lines of work included here as resampling from such methods as cross-validation, which is likely to use a Gaussian-distribution-based test to determine the goodness of fit of the model, and in any case does not break with the older tradition in this respect. The Normal distribution might enter into resampling work if the problem is to test whether a given sample fits the Gaussian shape reasonably closely. And it might be used to broaden the best-guess universe when there are very few observed data. This is one of the two characteristics that Diaconis and Efron also cite as fundamental to the methods under discussion here. They have written of "freedom from two limiting factors that have dominated statistical theory since its beginnings: the assumption that the data conform to a bell-shaped curve and the need to focus on statistical measures whose theoretical properties can be analyzed mathematically". And they say that "Freedom from the reliance on Gaussian assumptions is a signal development in statistics" (1983, p. 116). Even more generally, resampling proceeds without the use of any theoretical distributions, which is another reason not to consider the jackknife as a resampling method. The entire body of resampling methods (see e. g. Simon, 1969b, 1993; Noreen, 1986; and Efron and Tibshirani, 1993) proceeds without the Gaussian distribution. It should be noted, however, that there several reasons for departing from the Gaussian distribution. My aim was to avoid the use of any intellectual device or formula that the typical user does not understand completely, all the way down to the intuitive roots. The use of any parametric test founded on the Gaussian distribution fails on this criterion, if only because of the intuitive difficulty of the very formula for the Gaussian distribution, which few know and fewer understand. Technical advantages such as the increase in efficiency and reduction in bias that non-parametric (especially simulation) tests often (but not always) provide is in my view a bonus, rather than the central motivation that it was for Efron. Computer (or Computational) Intensivity In the view of many, involvement with computers is central to resampling methods under discussion here. Noreen called his book Computer Intensive Methods for Testing Hypotheses (1989), and Diaconis and Efron titled their 1983 Scientific American article "Computer-Intensive Methods in Statistics". In my view, however, computer intensivity is not a fundamental demarcation between resampling and conventional methods. For small data sets, resampling tests can often be done quite satisfactorily without any calculating machinery, let alone high-powered machinery. For example, the law-school example that is the centerpiece of Efron's "Computer-Intensive Methods..." article can be done with a pack of 15 cards. A hundred samples of 15 draws (with replacement) provides quite a satisfactory test for most purposes, and (aside from the computation of the correlation coefficient for each sample, which is not part of the bootstrap operation) can be done in an hour or two, less time than a conventional test might take even if the user did not have to look up the conventional formula. A computer is more convenient than shuffling cards, of course. But a thousand repetitions of that test can be done on the cheapest and most primitive personal computer in a couple of minutes at the most, which is not computationally intensive. And doing the test without the intercession of the computer often helps make the process intuitively clear to the person who performs the test. An example of a practical problem in hypothesis-testing, performed without the computer by a research assistant in an hour or so (Lyon and Simon, 1968), concerned whether average state income is related to the price elasticity of demand for cigarettes. The arc elasticity was estimated for 73 state tax changes, and then the medians were calculated for the 36 tax changes among the high-income states and the 36 tax changes among the low-income states. A Monte Carlo randomization test was then conducted by shuffling cards, and twenty trials were sufficient to show that the difference in observed medians was not infrequent on the null hypothesis of no difference due to income. Almost any regression analysis is at least as computer- intensive as most resampling methods. A difference between resampling methods (as defined here) and the jackknife and cross-validation is that though heavy use of the computer may not be necessary in many problems to arrive at acceptable resampling estimates, more intensive computing will produce more precise estimates. This is not true of the jackknife or cross-validation, which further distinguishes them from the methods referred to here as resampling. If statistical significance had been "in the cards" in the cigarette-tax case above, a much larger number of trials could have been drawn in a couple of hours. Because flipping coins and taking samples of random numbers with paper and pencil is cumbersome, and a nuisance after a while, I developed the Resampling Stats language in 1973. It was programmed in batch mode by Dan Weidenfeld for a mainframe (Simon and Weidenfeld, 1974), then in interactive mode for the Apple about 1980 by Derek Kumar, then for the IBM-PC starting in 1983, and in 1991 for the Macintosh. Standard languages such as Basic, or even languages written for the specific purpose of simulation (except APL), do not allow the user to write a program which closely resembles the operations one does by hand in resampling simulations, as does Resampling Stats. Nor do conventional statistical packages that provide a bootstrap option, such as Minitab or RATS. The language and program are illustrated below. Though the use of computers may not be crucial, there is no doubt that the easy and cheap access to personal computers has greatly advanced the use of resampling methods. Some Non-Issues Because of the identification of the bootstrap with the whole of resampling on the part of many persons, it is worth noting characteristics of the bootstrap that are not necessary characteristics of resampling tests generally. The bootstrap samples with replacement. But permutation tests and other resampling tests such as some correlation and matching tests sample without replacement (though correlation tests may also be done with replacement if it is judged appropriate). So the issue of replacement is not a defining characteristic of resampling. Efron wrote that "Originally I called the bootstrap distribution the `combination distribution'. That is because it takes combinations of the original data rather than permutations. There are no permutations to take in a one-sample problem." (letter of April 26, 1984). This characteristic distinguishes the bootstrap from permutation tests resampling in the line of Dwass, and Chung and Fraser, and also from the one-sample correlation problem for which I propose a measure of association (different from the correlation coefficient) which gets the job done, yet is intuitive and requires no formalism to explain (1969, examples 16-19, pp. 399-409), and is amenable to a resampling test of significance. But this characteristic is specific to the bootstrap and not to resampling at large. Intended for Complex and Difficult Problems, versus For All Problems Here we come to the crucial distinction between the point of view urged here and many other writers on resampling, including Hope (as quoted earlier), Westfall and Young (1993) and Hall (1992). The orientation away from routine problems also is seen in this quote from Mosteller: "It gives us another way to get empirical information in circumstances that almost defy mathematical analysis." (Kolata, 1988, p. C1). And though Efron's primary illustration (1983) -- the law-school GPA and LSAT correlation problem -- is well-handled by standard techniques, the main (though perhaps not exclusive) purpose of Efron's bootstrap seems to be to handle problems that are not easily dealt with by standard techniques, e. g. "the bootstrap can routinely answer questions which are far too complicated for traditional statistical analysis" (Efron and Tibshirani, 1986, p. 54). And "...the new methods free the statistician to attack more complicated problems, exploiting a wider array of statistical tools" (Efron and Diaconis, p. 116). In this respect Efron's focus is similar to that of the original Monte Carlo simulations of probabilistic problems sufficiently difficult to defy analytic solution, as noted earlier in connection with Ulam. Most of the articles in the technical literature describe advanced applications. (This is explainable to some considerable degree by the fact that the technical journals do not favor simple applications or transparent and "obvious" ideas. In contrast, the point of view urged here is that resampling provides a powerful tool that researchers and decision-makers (rather than only statisticians) can use with relatively small chance of error and with total understanding of the tool, in contrast to Normal-distribution-based methods which are understood down to the root by almost no users, no matter how sophisticated. (Evidence for the statement: Ask a small sample of users of statistics to write and interpret the formula for the Gaussian distribution.) Friedman expresses a similar view. "Eventually, it [he was referring to the bootstrap, but by implication the comment refers to all resampling] will take over the field, I think" (Kolata, 1988, p. C1). One of the virtues of resampling is that it induces users to invent their own methods. This does not imply keeping people in ignorance of resampling (and other) methods that have been invented by others, and surely the learning of that body of experience will assist them in re-invention. What is sought is that the user not simply choose among a set of pre-written templates or formulas and then simply fill in the unknowns, because that process is likely to result in an unsound choice of method. (An additional benefit of re-invention as a method of study is that people are particularly likely to remember what they themselves actively invent.) The true revolution connected with resampling, in my view, is in the step away from any analytic device in handling a particular set of data, away from "statistical measures whose theoretical properties can be analyzed mathematically", as Diaconis and Efron put it (1983, p. 116). The sample of resampling methods in my 1969 text takes this step to its logical extreme. The variety of methods was chosen to illustrate the power and scope of the general method, and also to stake out the ground for future discussion. COMMENTS 1. Resampling methods are not always better than other methods, nor are they always to be preferred; they can be more subject to skewness than conventional tests, and there can be so little information in a sample that combining additional assumptions such as Normality may improve reliability. Nevertheless, I suggest that one should think first of resampling methods in all or most situations. Furthermore, a resampling procedure may be the method of choice even when a more efficient conventional test exists, because of the higher likelihood that the wrong conventional test will be used than the wrong resampling test. That is, the likelihood of "Type 4 error" -- using the wrong test -- is lower when the user is oriented to resampling, a consideration which I consider to be of great importance. I urge that we think in terms of a validity concept - perhaps it should be given a label such as "statistical utility" - which takes into account the likelihood that an appropriate test will be used, as well as the efficiency of test that is used (assuming it to be appropriate)5. The proper way to assess the statistical utility of resampling versus other methods must be empirical inquiry rather than esthetic taste cum analytic judgment. And the controlled tests that bear upon the matter (see Simon, Atkinson, and Shevokas, 1976) find better results for resampling methods, even without the use of computers. The test of statistical utility should be with respect to users, in my view, and not with respect to statisticians. The notion that there is a skilled statistician with sound scientific judgment at the elbow of every user, and therefore that the test of statistical utility should be with respect to statisticians, seems quite implausible. If others disagree, the matter could easily be checked by examining a sample of scientific papers in various disciplines. 2. Insight into the prospects for the promotion of resampling as the tool of first recourse, in 1969 and now, can be gained from Efron's remark: "I've taken a tremendous amount of guff. Statisticians are hard to convince. They tend to be very conservative in practice." Indeed, he found that resampling methods met sheer disbelief at first. "When I presented it to people they said it wouldn't work", says Efron. And even if people accept its validity, they find reasons to reject it. "Some said it was too simple. Others said it was too complicated" (Science, p. 158). Fortunately for the field, he persevered. Another source of difficulty for resampling is the fundamental attitude of the statistics profession toward non- proof-based methods. As S. Stigler put the matter in a related connection: "Within the context of post-Newtonian scientific thought, the only acceptable grounds for the choice of an error distribution were to show that the curve could be mathematically derived from an acceptable set of first principles" (1986, p. 110). This may be related to Mosteller's comment quoted above that the bootstrap (and presumably all resampling) is "anti- intuitive" CONCLUSIONS AND SUMMARY There is solid agreement on the nature of the core techniques of resampling - stochastic permutation tests and bootstrap procedures. Both constitute a best-guess universe from the observed data, and they differ only in whether or not the drawings are replaced. There is less agreement about whether such simulation techniques as goodness-of-fit procedures should be considered resampling. In my view, they have many crucial characteristics in common with the core techniques - including the use of the best-guess-universe concept - and they differ greatly from the conventional methods in not calculating probabilities by way of sample-space analysis. Hence they should be considered part of the same extended set as the core techniques, I argue. Resampling appropriately includes hypothesis-testing as well as confidence intervals, as well as other devices such as goodness-of-fit. The literature has mostly addressed the use of resampling methods when conventional methods are not available, either because assumptions are not easily met or because the problems are too complex for conventional methods. In contrast, I urge that they should be the first alternative considered for all problems in probabilistic statistics (and in probability as well), though there are some problems for which resampling methods are inferior to conventional methods. They are practical tools for users of statistics who are not professional statisticians, and who all too often fall into confusion and frustration in using conventional methods which their intuition cannot follow down to the foundations. **FOOTNOTES** [1]: An example of the continuing belief among many statisticians that resampling methods should be used when closed- form methods are not feasible, rather than being the tool of first resort, may be found in a review by Leger et. al.: "The bootstrap should not be viewed as a replacement for mathematics, for only with a sound theoretical foundation can resampling methods be applied safely in practice." (1992, p. 396) ENDNOTES 1. I am grateful to Peter Bruce for his excellent suggestions and criticism of two previous drafts of this article. 2. John Pratt pointed out their work when I submitted an article to JASA, which he was then editing. 3. One could draw only a sub-sample of jackknife observations, with or without replacement, and consider the result a resampling test, akin to the relationship between the Dwass sampling procedure and the Fisher randomization test. But though sampling is essential for the feasibility of the Fisher test when the sample grows moderately, even in these days of cheap computation, this is not so for the jackknife because of the much smaller number of possibilities in the complete set. For other reasons to come, too, the jackknife is not in the spirit of other tests labeled here as resampling. But the inclusion or exclusion of the jackknife is not critical to the discussion, and hence it would be best not to get caught up in this matter. 4. Efron sometimes also uses the term "bootstrap" in fashions other than the above definition from time to time. For example, he writes of "bootstrapping the entire process of data analysis" (1983 article with Diaconis), which suggests that he identifies the term with all resampling methods including permutation tests, etc. And in some places he refers to it as a "method for assigning measures of accuracy to statistical estimates" (Efron-Tibshirani, p. 10), while elsewhere he includes hypothesis tests, so either there is no difference in his mind between those two topics or his definition shifts from time to time. 5. This point was stressed by Simon, Atkinson, and Shevokas. It must be emphasized that the Monte Carlo method as described here really is intended as an alternative to conventional analytic methods in actual problem-solving practice. This method is not a pedagogical device for improving the teaching of conventional methods. This is quite different than the past use of the Monte Carlo method to help teach sampling theory, the binomial theorem and the central limit theorem. The point that is usually hardest to convey to teachers of statistics is that the method suggested here really is a complete break with conventional thinking, rather than a supplement to it or an aid in teaching it. That is, the simple Monte Carlo method described here is complete in itself for handling most -- perhaps all -- problems in probability and statistics (1976, p. 734, second italics added here). This does not include such matters as the design of experiments, and decision analysis. It also would be better to use the term "problems in compound probability calculation". And at that time we were not aware of some of the limitations of the bootstrap and presumably of other resampling tests that have been uncovered since then.