CHAPTER III-2

                BACKGROUND AND ANALYSIS OF THE RESAMPLING METHOD

                                  INTRODUCTION

             The term "resampling" has been applied to a variety of

        techniques for statistical inference, among which stochastic

        permutation and the bootstrap are the most characteristic.

        Resampling methods are evolving rapidly, and their scopes and

        interrelationships are not always clear.  Therefore, the aim of

        this chapter is to distinguish the various techniques falling

        under the resampling rubric from other related techniques, in

        order to aid discussion of the set of methods.

             There are two domains corresponding to the term

        "resampling".  The wider domain includes all uses of simulation

        techniques for statistical inference (though not uses of

        simulation for the development of other techniques).  The

        narrower sub-domain includes only simulation techniques that

        reuse the observed data to constitute a universe from which to

        draw repeated samples (without replacement = permutation

        techniques, with replacement = bootstrap techniques).

             Another statement of definition of the narrower

        quintessential domain:  Those techniques that a) use in their

        entirety (though not necessarily with replacement) the sample

        data to repeatedly produce hypothetical samples, either by

        drawing subsamples stochastically or by rearranging the original

        observations stochastically, and then b) compare the results of

        those simulation samples to the observed sample.

             It will not always be easy to keep these two domains - or
        domain and sub-domain - clearly distinguished.  Additionally, I
        argue that it is often useful to extend the term "resampling"
        beyond statistical inference and into probability, where one
        generates the simulation samples from a known device rather than
        from an unknown universe estimated by the observed data, because
        the mathematical simulation processes are identical in
        probability and statistics.

             Because resampling is still in its early stages, there is

        little consensus about its definitions as well as its practices,

        which means that the discussion will inevitably have many loose

        ends and be open to many rebuttals.   But I hope that the reader

        will view the vigor and yeastiness of the controversy as

        indicating that this is the beginning of a discussion where there

        are important issues to be discussed, rather than concluding that

        the discussion that follows is unsatisfactory because it is

        subject to so many criticisms.  (Indeed, absence of loose ends

        typically indicates that a topic is so settled that no further

        discussion is needed.)  The appropriate question, as I see it, is

        not whether there are flaws in the discussion to follow, but

        rather whether the issues deserve to be aired in public where

        they can be thrashed out.

             The next section discusses the intellectual paths that have

        led to the general resampling method.  The following section

        provides a classification of resampling methods and discusses

        their characteristics.  After that comes a section of comment.


                               ROADS TO RESAMPLING

             Several quite different intellectual roads have led to the
        body of methods called "resampling" as of 1996.  The fact that
        such different roads lead to the same place may be considered
        empirical evidence for the inevitability of the general approach
        to inferential statistics, even before high-speed computers
        became commonplace and cheap.


        Dwass, and Chung-Fraser:  Approximation of Classical Method

             The first publication of any of the techniques that now make

        up the resampling kit bag was by Meyer Dwass in 1957, and by J.

        H. Chung and D. A. S. Fraser in 1958.  Both papers pointed to the

        value of Fisher's permutation test (1935; see also Pitman's

        advances in the direction begun by Fisher; 1937, pp. 322-335)

        misleadingly called the "randomization" technique.  Both noted

        that with a large sample the "exact" Fisher test is not feasible

        because of the computational difficulty (before the age of

        powerful computers).  They then suggested that a randomly-

        generated selected sub-set of the possible permutations could

        provide the benefits of the permutation test without excessive

        computational cost.

             The underlying idea was to use the power of sampling, in a

        fashion similar to the way it is used in empirical samples from

        large universes of data, in order to approximate the ideal test

        based on the complete set of permutations.  And they showed that

        the approximation would be quite satisfactory.  So their vision

        was to gain the benefits of the classical array of methods -

        though not a parametric test in this case - by the technical

        device of simulation sampling.

             Unfortunately, this technical trick does not work in the
        case of parametric tests themselves, because there is no way to
        use the device of sampling to replace the formulaic basis and the
        tabular superstructure of such methods as the Normal-based z test
        or the t-test (though the stochastic permutation test may be seen
        as a  substitute for the t-test and hence a way of evading its
        use).  Therefore, the path opened by Dwass and Chung-Fraser does
        not immediately broaden out intellectually into the resampling
        highway, though Draper and Stoneman built on the earlier work
        with an application of the permutation test to regression (1966).

             The first method used for comparison of liquor prices in

        Chapter III-1 is a permutation test.

             It should be noted that the stochastic permutation test is

        one of the two central resampling techniques; it is true

        "resampling" in the sense of treated the observed data as the

        best guess about the nature of the universe of interest, and then

        re-using those data as the basis for experimental samples.  It

        should also be noted that stochastic permutation and the

        bootstrap are identical except for whether or not the samples are

        taken with replacement, and the two methods converge toward the

        same result as observed sample size increases; in many

        applications it is difficult or impossible to establish a clear

        philosophic justification for the use of one or the other

        technique.  There is one conceptual difference, however; the

        stochastic permutation may be plausibly be seen as a sampling

        "approximation" to an "exact" technique; no such notion is

        possible for the bootstrap, so the latter is even more distant

        from the conventional approach than is the former.

             The inherent sensibleness of a stochastic permutation test

        is evidenced by its independent simultaneous discovery and also

        by its later re-discovery by Feinstein (1973), and by me (1969).

        Additional evidence of this is the independent re-discovery of

        the same stochastic permutation principle in the somewhat

        different context of tests of significance with survival data in

        1970 by Forsythe and Frey.

             The foregoing writers viewed a stochastic simulation as a

        less exact approximation of the ideal test.  There was no mention

        that the essential nature of simulation differs from formulaic

        tests in not requiring counting the points in the sample space,

        the central element in probability theory.


        Barnard's Test for the Fit to a Distribution

             In a few brief paragraphs in a comment in 1963 that is

        difficult to find even with the citation in hand, Barnard

        (1963)suggested a simulation test for how well a given sample

        fits a theoretical distribution - specifically, a runs test based

        on comparison to the results of drawings from a (horizontal)

        distribution - and envisioned doing the work with the aid of a

        computer.  As with Dwass and Chung-Fraser, Barnard was offering a

        simulation technique as a inexact substitute for a formulaic

        method when the formulaic method is infeasible.  This was stated

        clearly in a comment by Hope (1968) in the context of further

        work on Barnard's test.  Hope recommended that Monte Carlo tests

        not be the tool of first resort.  "It is preferable to use a

        known test of good efficiency instead of a Monte Carlo test

        procedure ..." (Hope, 1968, p. 582).  Hope did go further,

        however, and recommended that a resampling test be used when "the

        necessary conditions for applying the [conventional] test may not

        be satisfied, or the underlying distribution may be unknown or it

        may be difficult to decide on an appropriate test criterion.

        Also, it is possible that only a physical model can be obtained

        which cannot be expressed in mathematical terms" (p. 582),  But

        resampling still is seen as a second-best method. [1]

             Barnard's test is in the penumbra of resampling because it

        uses an independent device - a coin, or random numbers - to

        generate the trial simulation samples.  It may be viewed either

        as a third category intermediate between formulaic methods and

        core resampling, or as a member of the larger resampling domain

        but not of the core sub-domain.

        An Overall Approach to Inference, and the Bootstrap

             In 1967 (Simon, 1969a, 1969b, chapters 23-25; 3rd edition

        with Paul Burstein, 1985)) I developed the resampling method in a

        very general context, starting with first principles of

        simulation and statistical inference rather than with any

        particular formulaic device.   I illustrated the general idea and

        showed its breadth and power with a variety of methods (including

        the bootstrap and the stochastic permutation test and many

        others) for a range of problems including hypothesis tests,

        confidence intervals, fits to distributions, fixing of sample

        size, and other statistical needs.  The intellectual basis was

        the centuries-old practice of experimentation to learn the odds

        in gambling games, together with the idea of Monte Carlo

        simulations of complex physical phenomena at Rand during and

        after World War II (see Ulam, 1976, pp. 196-199; Metropolis and

        Ulam, 1949; for their group, Monte Carlo was "a statistical

        approach to the study of differential equations", a device for

        dealing with the "completely intractable task...in closed form"

        [pp. 335, 337, 338], that is, a taking from statistics, whereas

        for me the method was an approach to statistical practice, a

        giving to statistics).  I referred to the work as "Monte Carlo"

        and I wrote that it departed from the earlier work of Dwass and

        Chung-Fraser (seen as examples of the practice of resampling at

        large) and at Rand in two main ways:  1) I dealt with simple

        problems where the value of the technique's use was that persons

        could arrive at sound solutions that are perfectly understandable

        rather than using mysterious formulas that are often wrongly

        chosen; this included problems in probability as well as

        statistical inference.  2) Despite illustrating the use of the

        same general method on probabilistic problems, I focused on the

        problems of statistical inference rather than those considered in

        studies of probability, and mapped out the entire range of

        applied problems in inferential statistics; this emphasis on a

        new general technique to be used across the board, rather than a

        single particular device to be used in a particular situation;

        was the most radical innovation and the aspect of the work that

        evoked (and still evokes) the most resistance because it calls

        into question the existing body of formulaic methods.

             When first developing this material I was not aware of the

        work of Dwass and of Chung and Fraser and hence, like Feinstein

        after me (1973), I re-invented their idea2.  I subsequently

        attributed to them the entire vision of a Monte Carlo approach to

        statistical inference, though in retrospect it can be seen that

        their view of the matter was much less general.

             When in 1976, I (with Atkinson and Shevokas in 1976)

        published results of controlled experiments showing that persons

        arrive at more correct answers to basic statistical problems when

        they are taught and employ resampling methods than conventional

        formulaic methods, I wrote:  "It must be emphasized that the

        Monte Carlo method as described here really is intended as an

        alternative to conventional analytic methods in actual problem-

        solving practice...the simple Monte Carlo method described here

        is complete in itself for handling most - perhaps all - problems

        in probability and statistics" (1976, p. 734).  I believe that

        that statement, with the vision that it expresses, is the radical

        departure from previous thought on the practice of statistics.

        The development of particular techniques is subsidiary.

             It seems to me that this general statement is the most

        important element in the resampling approach.

        The Bootstrap: Re-sampling with Replacement

             The prices of liquor in private-enterprise and state-owned

        systems discussed in Chapter III-1 can also be done with a

        process of sampling with replacement rather than permuation; such

        a test was dubbed the "bootstrap" by Bradley Efron in 1979.  It

        was first published in three examples in Simon (1969b and in

        further discussion in correspondence with Kruskal, 1969, and was

        in common use at the University of Illinois in the early 1970s,

        being the only method used for hypothesis-testing in the 1976

        text by Atkinson, Shevokas, and Travers.

             Recently I have concluded that a bootstrap-type test has

        better theoretical justification than a permutation test in this

        case, though the two reach almost identical results with a sample

        this large.  The following discussion of which is most appropri-

        ate brings out the underlying natures of the two approaches, and

        illustrates how resampling raises issues which tend to be buried

        amidst the technical complexity of the formulaic methods, and

        hence are seldom discussed in print.

             Imagine a class of 42 students, 16 men and 26 women who come

        into the room and sit in 42 fixed seats. We measure the distance

        of each seat to the lecturer, and assign each a rank. The women

        sit in ranks 1-5, 7-20, etc., and the men in ranks 6, 22, 25-26,

        etc.  You ask:  Is there a relationship between sex and ranked

        distance from the front?  Here the permutation procedure that

        resamples without replacement - as used above with the state

        liquor prices - quite clearly is appropriate.

             Now, how about if we work with actual distances from the

        front?  If there are only 42 seats and they are fixed, the permu-

        tation test and sampling without replacement again is appropri-

        ate.  But how about if seats are movable?

             Consider the possible situation in which one student can

        choose position without reference to others.  That is, if the

        seats are movable, it is not only imaginable that A would be

        sitting where B now is, with B in A's present seat - as was the

        case with the fixed chairs - but A could now change distance from

        the lecturer while all the others remain as they are. Sampling

        with replacement now is appropriate. (To use a technical term,

        the cardinal data provide more actual degrees of freedom - more

        information - than do the ranks).

             Note that (as with the liquor prices) the seat distances do

        not comprise an infinite population.  Rather, we are inquiring

        whether a) the universe should be considered limited to a given

        number of elements, or b) could be considered expandable without

        change in the probabilities; the latter is a useful definition of

        "sampling with replacement".

             As of 1996, the U.S. state liquor systems seem to me to

        resemble a non-fixed universe (like non-fixed chairs) even though

        the actual number of states is presently fixed.  The question the

        research asked was whether the liquor system affects the price of

        liquor.  We can imagine another state being admitted to the

        union, or one of the existing states changing its system, and

        pondering how the choice of system will affect the price.  And

        there is no reason to believe that (at least in the short run)

        the newly-made choice of system would affect the other states'

        pricing; hence it makes sense to sample with replacement (and use

        the bootstrap) even though the number of states clearly is not

        infinite or greatly expandable.

             In short, the presence of interaction - a change in one

        entity causing another entity also to change - implies a finite

        universe composed of those elements, and use of a permutation

        test. Conversely, when one entity can change independently, an

        infinite universe and sampling with replacement with a bootstrap

        test is indicated.

        Efron's Route to the Bootstrap and Development of It

             Efron connects his work - at first, his apparent rediscovery of the

        bootstrap and later his wider applications of resampling - to the

        jackknife.  "Historically the subject begins with the Quenouille-Tukey

        jackknife, which is where we will begin also" (1979; 1982, p. 1).  This

        connection to the jackknife is immediately obvious in the title of his

        first work on the subject, and in his discussions on the subject since

        then.

             Diaconis and Efron later wrote:  "There are close

        theoretical connections among the methods [cross-validation,

        jackknife, bootstrap].  One line of thinking develops them all,

        as well as several others, from the bootstrap" (1983, p. 130).

        But this statement refers to the logical connections, which are

        the reverse of Efron's historical process.

             However, the jackknife (Quenouille, 1956, and Tukey, 1958)

        and cross-validation (or "sample splitting"; see Mosier, 1951,

        pp. 5-11), are entirely outside the definitions of resampling,

        whether narrow or broad.  They are connected with each other and

        with the bootstrap by a very different line of thinking than the

        concept of resampling; rather, they share the common aim of

        inferring reliability.  That is, the motivations for the

        Quenouille and Tukey in inventing the jackknife, and for Efron in

        developing the bootstrap, may have been similar.  The natures of

        the devices are very different, to wit:

             Cross-validation separates the available data into two or

        more segments, and then tests the model generated in one segment

        against the data in the other segment(s); clearly there is no re-

        use of the same data, nor is there any use of repeated simulation

        trials.

             The jackknife - which, as Low (entry in the Encyclopedia of

        Statistical Science, Kotz & Johnson, v. 8, 1983) notes,

        "reduce[s] the size of the sample [really, the resampling

        universe] in each of the re-computations of the statistic", does

        not use the data in their entirety for each trial3, a key

        characteristic of resampling.  That is, the observations omitted

        from the experimental samples are designated systematically,

        whereas in resampling (see definition below) observations are

        probabilistically omitted from the experimental samples.  To put

        it differently, every jackknife analysis of a given set of data

        produces the same result, unlike resampling processes (assuming

        no problem with the seed in the random-number generator).

             The jackknife has in common with the resampling techniques

        discussed here the partial re-use of data, but it does not a)

        resample from it, or b) use all of it for any given sample.  The

        jackknife has more in common with such scientific practices as

        examining the results when leaving off the extreme observations

        in a sample -- as is suggested visually by some of Tukey's

        graphic techniques -- than it does with other methods included in

        the present definition of resampling.  (The jackknife also makes

        use of the t distribution, which also puts it outside of the

        basic definitions of resampling as discussed below.)

             Indeed, though Efron (quoted above) came to the bootstrap

        and then resampling more generally by way of the jackknife, he

        notes:  "In fact it would be more logical to begin with the

        bootstrap..." (1982, p. 1).  (And indeed, discussion of the

        jackknife has diminished severely over time in connection with

        resampling.)  But it is even more logical to begin with the

        general vision of resampling as embodied in the definition given

        here and in the wide range of techniques shown in my 1969 book.

        The literature has been moving more and more toward that vision,

        as I read the literature.  And there is some movement in

        introductory texts to present resampling techniques as tools of

        first resort rather than tool to which one should turn only when

        stymied in the search for formulaic methods.

             There seems to have been no connection between Efron's

        development and the concept of Monte Carlo simulation, and

        simulation in general; the index of Efron and Tibshirani (1993)

        lists them only with reference to specific practices in a few

        particular bootstrap applications, with no reference to Stan Ulam

        (the putative father or the Monte Carlo method and label at

        Rand).  Nor is there connection between Efron's work and that of

        Dwass and Chung-Fraser; to my knowledge, they are never referred

        to in his writings (though I have not examined them all).

             At the end of this survey of the origins of resampling, it

        is interesting to note that the long tradition of experimental

        studies of distributions and properties of estimators in

        statistics and econometrics, with Student being an early

        distinguished example, did not enter into the thinking of any of

        the intellectual streams discussed above.  Nor did the use of

        simulation for pedagogical illustration such as the sampling

        distribution of the sample mean.


              THE CHARACTERISTICS AND THE CLASSIFICATION OF METHODS

             The previous section briefly described resampling, giving

        both a core definition and a wider definition.  This section goes

        into more detail about the characteristics of resampling

        techniques in contrast to techniques that are outside the

        resampling domain(s).


        Re-use of the Available Data to Generate Repeated Samples

             Systematic re-use of the available data is the central

        characteristic of resampling, and it is at the core of the

        following core definition of resampling:  Use in their entirety

        (though not necessarily with replacement) the observed data to

        repeatedly produce experimental samples, either by drawing

        subsamples stochastically or by rearranging the original

        observations stochastically, and then compare the results of

        those simulation-trial samples to the original sample data.

             Consider this Efron-Tibshirani definition of the bootstrap:

        "A bootstrap sample x* = (x1* ...) is obtained by randomly

        sampling n times, with replacement, from the original data points

        x1 ..."  Another similar and very clear statement elsewhere:

        "Each bootstrap sample has n elements, generated by sampling with

        replacement n times from the original data set" (p. 13,  below

        their Figure 2.1). If we now amend that definition by writing

        "with or without replacement", the permutation test is included

        and the definition is a formal and precise description of core

        resampling methods4.

             The wider definition of resampling also includes other uses
        of simulation techniques for statistical inference that generate
        samples by random drawings from distributions derived other than
        from the observed data - for example, Barnard's drawings from a
        horizontal distribution against which to compare an observed
        sample.  This wider definition would seem to encompass devices to
        serve all or most of the purposes of statistical inference, and
        such a wide definition was the basis of the suggestion made
        explicitly in Simon, Atkinson, and Shevokas (1976) that
        resampling be thought of as the first option in all situations.

             If I were to choose a label for the wider domain, I would

        call it the "best-guess-universe" method.  This not only has the

        virtue of including inferential simulation methods other than

        those that re-use the data, but this label also points up that

        when one has a better guess about the universe than just the

        observed data - when the data are very few, for example, and

        other information or assumptions should be used in a Bayesian

        spirit - one should then not be limited to the use of the

        observed data.

             One can also broaden the definition of core resampling to

        include not only problems in probabilistic statistics but also

        problems in probability, by including the phrase "or the data-

        generating mechanism (such as a die)" after "observed data" in

        the definition above.  Problems in pure probability may at first

        seem different in nature than the probabilistic-statistical

        (inverse probability) problems, and foreign to the concerns of

        statisticians.  But the same logic as stated in the definition

        above applies to problems in probability as to problems in

        inferential statistics.  The only difference is that in

        probability problems the model is known in advance -- say, the

        model implicit in a deck of cards plus a game's rules for dealing

        and counting the results -- rather than the model being assumed

        to be inferred, and best described by the observed data, as in

        resampling statistics.

             Efron has given a definition in the same spirit:  "You use

        the data to estimate probabilities and then you pick yourself up

        by your bootstraps and see how variable the data are in that

        framework" (Science, 13 July, 1984, p. 157).  Though Efron

        focuses upon the variability of a sample statistic in this

        definition, the centrality of re-use is apparent.  (And though he

        was referring only to the bootstrap technique, this definition

        obviously applies to permutation tests as well.)

             It has been noted earlier that the jackknife and cross-

        validation do not fit the definition of resampling.  Nor do other

        standard closed-form methods in inference.


        Non-Use of the Gaussian Distribution

             The non-use of the Normal distribution is another of the

        central characteristics of resampling.  The Gaussian

        characteristic separates the lines of work included here as

        resampling from such methods as cross-validation, which is likely

        to use a Gaussian-distribution-based test to determine the

        goodness of fit of the model, and in any case does not break with

        the older tradition in this respect.

             The Normal distribution might enter into resampling work if

        the problem is to test whether a given sample fits the Gaussian

        shape reasonably closely.  And it might be used to broaden the

        best-guess universe when there are very few observed data.

             This is one of the two characteristics that Diaconis and

        Efron also cite as fundamental to the methods under discussion

        here.  They have written of "freedom from two limiting factors

        that have dominated statistical theory since its beginnings:  the

        assumption that the data conform to a bell-shaped curve and the

        need to focus on statistical measures whose theoretical

        properties can be analyzed mathematically".  And they say that

        "Freedom from the reliance on Gaussian assumptions is a signal

        development in statistics" (1983, p. 116).

             Even more generally, resampling proceeds without the use of

        any theoretical distributions, which is another reason not to

        consider the jackknife as a resampling method.

             The entire body of resampling methods (see e. g. Simon,

        1969b, 1993; Noreen, 1986; and Efron and Tibshirani, 1993)

        proceeds without the Gaussian distribution.  It should be noted,

        however, that there several reasons for departing from the

        Gaussian distribution.  My aim was to avoid the use of any

        intellectual device or formula that the typical user does not

        understand completely, all the way down to the intuitive roots.

        The use of any parametric test founded on the Gaussian

        distribution fails on this criterion, if only because of the

        intuitive difficulty of the very formula for the Gaussian

        distribution, which few know and fewer understand.  Technical

        advantages such as the increase in efficiency and reduction in

        bias that non-parametric (especially simulation) tests often (but

        not always) provide is in my view a bonus, rather than the

        central motivation that it was for Efron.

        Computer (or Computational) Intensivity

             In the view of many, involvement with computers is central

        to resampling methods under discussion here.  Noreen called his

        book Computer Intensive Methods for Testing Hypotheses (1989),

        and Diaconis and Efron titled their 1983 Scientific American

        article "Computer-Intensive Methods in Statistics".

             In my view, however, computer intensivity is not a

        fundamental demarcation between resampling and conventional

        methods.  For small data sets, resampling tests can often be done

        quite satisfactorily without any calculating machinery, let alone

        high-powered machinery.  For example, the law-school example that

        is the centerpiece of Efron's "Computer-Intensive Methods..."

        article can be done with a pack of 15 cards.  A hundred samples

        of 15 draws (with replacement) provides quite a satisfactory test

        for most purposes, and (aside from the computation of the

        correlation coefficient for each sample, which is not part of the

        bootstrap operation) can be done in an hour or two, less time

        than a conventional test might take even if the user did not have

        to look up the conventional formula.  A computer is more

        convenient than shuffling cards, of course.  But a thousand

        repetitions of that test can be done on the cheapest and most

        primitive personal computer in a couple of minutes at the most,

        which is not computationally intensive.  And doing the test

        without the intercession of the computer often helps make the

        process intuitively clear to the person who performs the test.

             An example of a practical problem in hypothesis-testing,

        performed without the computer by a research assistant in an hour

        or so (Lyon and Simon, 1968), concerned whether average state

        income is related to the price elasticity of demand for

        cigarettes.  The arc elasticity was estimated for 73 state tax

        changes, and then the medians were calculated for the 36 tax

        changes among the high-income states and the 36 tax changes among

        the low-income states.  A Monte Carlo randomization test was then

        conducted by shuffling cards, and twenty trials were sufficient

        to show that the difference in observed medians was not

        infrequent on the null hypothesis of no difference due to income.

             Almost any regression analysis is at least as computer-

        intensive as most resampling methods.

             A difference between resampling methods (as defined here)

        and the jackknife and cross-validation is that though heavy use

        of the computer may not be necessary in many problems to arrive

        at acceptable resampling estimates, more intensive computing will

        produce more precise estimates.  This is not true of the

        jackknife or cross-validation, which further distinguishes them

        from the methods referred to here as resampling. If statistical

        significance had been "in the cards" in the cigarette-tax case

        above, a much larger number of trials could have been drawn in a

        couple of hours.

             Because flipping coins and taking samples of random numbers

        with paper and pencil is cumbersome, and a nuisance after a

        while, I developed the Resampling Stats language in 1973. It was

        programmed in batch mode by Dan Weidenfeld for a mainframe (Simon

        and Weidenfeld, 1974), then in interactive mode for the Apple

        about 1980 by Derek Kumar, then for the IBM-PC starting in 1983,

        and in 1991 for the Macintosh.  Standard languages such as Basic,

        or even languages written for the specific purpose of simulation

        (except APL), do not allow the user to write a program which

        closely resembles the operations one does by hand in resampling

        simulations, as does Resampling Stats.  Nor do conventional

        statistical packages that provide a bootstrap option, such as

        Minitab or RATS.  The language and program are illustrated below.

             Though the use of computers may not be crucial, there is no

        doubt that the easy and cheap access to personal computers has

        greatly advanced the use of resampling methods.


        Some Non-Issues

             Because of the identification of the bootstrap with the

        whole of resampling on the part of many persons, it is worth

        noting characteristics of the bootstrap that are not necessary

        characteristics of resampling tests generally.

             The bootstrap samples with replacement.  But permutation

        tests and other resampling tests such as some correlation and

        matching tests sample without replacement (though correlation

        tests may also be done with replacement if it is judged

        appropriate).  So the issue of replacement is not a defining

        characteristic of resampling.

             Efron wrote that "Originally I called the bootstrap

        distribution the `combination distribution'.  That is because it

        takes combinations of the original data rather than permutations.

        There are no permutations to take in a one-sample problem."

        (letter of April 26, 1984).  This characteristic distinguishes

        the bootstrap from permutation tests resampling in the line of

        Dwass, and Chung and Fraser, and also from the one-sample

        correlation problem for which I propose a measure of association

        (different from the correlation coefficient) which gets the job

        done, yet is intuitive and requires no formalism to explain

        (1969, examples 16-19, pp. 399-409), and is amenable to a

        resampling test of significance.  But this characteristic is

        specific to the bootstrap and not to resampling at large.


        Intended for Complex and Difficult Problems, versus For All

        Problems

             Here we come to the crucial distinction between the point of

        view urged here and many other writers on resampling, including

        Hope (as quoted earlier), Westfall and Young (1993) and Hall

        (1992).  The orientation away from routine problems also is seen

        in this quote from Mosteller: "It gives us another way to get

        empirical information in circumstances that almost defy

        mathematical analysis." (Kolata, 1988, p. C1). And though Efron's

        primary illustration (1983) -- the law-school GPA and LSAT

        correlation problem -- is well-handled by standard techniques,

        the main (though perhaps not exclusive) purpose of Efron's

        bootstrap seems to be to handle problems that are not easily

        dealt with by standard techniques, e. g. "the bootstrap can

        routinely answer questions which are far too complicated for

        traditional statistical analysis" (Efron and Tibshirani, 1986, p.

        54).  And "...the new methods free the statistician to attack

        more complicated problems, exploiting a wider array of

        statistical tools" (Efron and Diaconis, p. 116). In this respect

        Efron's focus is similar to that of the original Monte Carlo

        simulations of probabilistic problems sufficiently difficult to

        defy analytic solution, as noted earlier in connection with Ulam.

             Most of the articles in the technical literature describe

        advanced applications. (This is explainable to some considerable

        degree by the fact that the technical journals do not favor

        simple applications or transparent and "obvious" ideas.

             In contrast, the point of view urged here is that resampling

        provides a powerful tool that researchers and decision-makers

        (rather than only statisticians) can use with relatively small

        chance of error and with total understanding of the tool, in

        contrast to Normal-distribution-based methods which are

        understood down to the root by almost no users, no matter how

        sophisticated.  (Evidence for the statement: Ask a small sample

        of users of statistics to write and interpret the formula for the

        Gaussian distribution.)

             Friedman expresses a similar view.  "Eventually, it [he was

        referring to the bootstrap, but by implication the comment refers

        to all resampling] will take over the field, I think" (Kolata,

        1988, p. C1).

             One of the virtues of resampling is that it induces users to

        invent their own methods.  This does not imply keeping people in

        ignorance of resampling (and other) methods that have been

        invented by others, and surely the learning of that body of

        experience will assist them in re-invention.  What is sought is

        that the user not simply choose among a set of pre-written

        templates or formulas and then simply fill in the unknowns,

        because that process is likely to result in an unsound choice of

        method.  (An additional benefit of re-invention as a method of

        study is that people are particularly likely to remember what

        they themselves actively invent.)

             The true revolution connected with resampling, in my view,

        is in the step away from any analytic device in handling a

        particular set of data, away from "statistical measures whose

        theoretical properties can be analyzed mathematically", as

        Diaconis and Efron put it (1983, p. 116).  The sample of

        resampling methods in my 1969 text takes this step to its logical

        extreme.  The variety of methods was chosen to illustrate the

        power and scope of the general method, and also to stake out the

        ground for future discussion.


                                    COMMENTS

                   1.  Resampling methods are not always better than

        other methods, nor are they always to be preferred; they can be

        more subject to skewness than conventional tests, and there can

        be so little information in a sample that combining additional

        assumptions such as Normality may improve reliability.

        Nevertheless, I suggest that one should think first of resampling

        methods in all or most situations.

             Furthermore, a resampling procedure may be the method of

        choice even when a more efficient conventional test exists,

        because of the higher likelihood that the wrong conventional test

        will be used than the wrong resampling test.  That is, the

        likelihood of "Type 4 error" -- using the wrong test -- is lower

        when the user is oriented to resampling, a consideration which I

        consider to be of great importance.  I urge that we think in

        terms of a validity concept - perhaps it should be given a label

        such as "statistical utility" - which takes into account the

        likelihood that an appropriate test will be used, as well as the

        efficiency of test that is used (assuming it to be appropriate)5.

        The proper way to assess the statistical utility of resampling

        versus other methods must be empirical inquiry rather than

        esthetic taste cum analytic judgment.  And the controlled tests

        that bear upon the matter (see Simon, Atkinson, and Shevokas,

        1976) find better results for resampling methods, even without

        the use of computers.

             The test of statistical utility should be with respect to

        users, in my view, and not with respect to statisticians.  The

        notion that there is a skilled statistician with sound scientific

        judgment at the elbow of every user, and therefore that the test

        of statistical utility should be with respect to statisticians,

        seems quite implausible.  If others disagree, the matter could

        easily be checked by examining a sample of scientific papers in

        various disciplines.

             2.  Insight into the prospects for the promotion of

        resampling as the tool of first recourse, in 1969 and now, can be

        gained from Efron's remark:  "I've taken a tremendous amount of

        guff.  Statisticians are hard to convince.  They tend to be very

        conservative in practice."  Indeed, he found that resampling

        methods met sheer disbelief at first.  "When I presented it to

        people they said it wouldn't work", says Efron.  And even if

        people accept its validity, they find reasons to reject it.

        "Some said it was too simple.  Others said it was too

        complicated" (Science, p. 158).  Fortunately for the field, he

        persevered.

             Another source of difficulty for resampling is the

        fundamental attitude of the statistics profession toward non-

        proof-based methods.  As S. Stigler put the matter in a related

        connection: "Within the context of post-Newtonian scientific

        thought, the only acceptable grounds for the choice of an error

        distribution were to show that the curve could be mathematically

        derived from an acceptable set of first principles" (1986, p.

        110). This may be related to Mosteller's comment quoted above

        that the bootstrap (and presumably all resampling) is "anti-

        intuitive"

                            CONCLUSIONS AND SUMMARY

             There is solid agreement on the nature of the core

        techniques of resampling - stochastic permutation tests and

        bootstrap procedures.  Both constitute a best-guess universe from

        the observed data, and they differ only in whether or not the

        drawings are replaced.

             There is less agreement about whether such simulation

        techniques as goodness-of-fit procedures should be considered

        resampling.  In my view, they have many crucial characteristics

        in common with the core techniques - including the use of the

        best-guess-universe concept - and they differ greatly from the

        conventional methods in not calculating probabilities by way of

        sample-space analysis.  Hence they should be considered part of

        the same extended set as the core techniques, I argue.

             Resampling appropriately includes hypothesis-testing as well

        as confidence intervals, as well as other devices such as

        goodness-of-fit.

             The literature has mostly addressed the use of resampling

        methods when conventional methods are not available, either

        because assumptions are not easily met or because the problems

        are too complex for conventional methods.  In contrast, I urge

        that they should be the first alternative considered for all

        problems in probabilistic statistics (and in probability as

        well), though there are some problems for which resampling

        methods are inferior to conventional methods.  They are practical

        tools for users of statistics who are not professional

        statisticians, and who all too often fall into confusion and

        frustration in using conventional methods which their intuition

        cannot follow down to the foundations.


        **FOOTNOTES**

             [1]: An example of the continuing belief among many
        statisticians that resampling methods should be used when closed-
        form methods are not feasible, rather than being the tool of
        first resort, may be found in a review by Leger et. al.: "The
        bootstrap should not be viewed as a replacement for mathematics,
        for only with a sound theoretical foundation can resampling
        methods be applied safely in practice." (1992, p. 396)


                                      ENDNOTES

             1.  I am grateful to Peter Bruce for his excellent

        suggestions and criticism of two previous drafts of this article.

             2.  John Pratt pointed out their work when I submitted an

        article to JASA, which he was then editing.

             3.  One could draw only a sub-sample of jackknife

        observations, with or without replacement, and consider the

        result a resampling test, akin to the relationship between the

        Dwass sampling procedure and the Fisher randomization test.  But

        though sampling is essential for the feasibility of the Fisher

        test when the sample grows moderately, even in these days of

        cheap computation, this is not so for the jackknife because of

        the much smaller number of possibilities in the complete set.

        For other reasons to come, too, the jackknife is not in the

        spirit of other tests labeled here as resampling.  But the

        inclusion or exclusion of the jackknife is not critical to the

        discussion, and hence it would be best not to get caught up in

        this matter.

             4.  Efron sometimes also uses the term "bootstrap" in

        fashions other than the above definition from time to time.  For

        example, he writes of "bootstrapping the entire process of data

        analysis" (1983 article with Diaconis), which suggests that he

        identifies the term with all resampling methods including

        permutation tests, etc.  And in some places he refers to it as a

        "method for assigning measures of accuracy to statistical

        estimates" (Efron-Tibshirani, p. 10), while elsewhere he includes

        hypothesis tests, so either there is no difference in his mind

        between those two topics or his definition shifts from time to

        time.

             5.  This point was stressed by Simon, Atkinson, and

        Shevokas.

             It must be emphasized that the Monte Carlo method as
             described here really is intended as an alternative to
             conventional analytic methods in actual problem-solving
             practice.  This method is not a pedagogical device for
             improving the teaching of conventional methods.  This
             is quite different than the past use of the Monte Carlo
             method to help teach sampling theory, the binomial
             theorem and the central limit theorem.  The point that
             is usually hardest to convey to teachers of statistics
             is that the method suggested here really is a complete
             break with conventional thinking, rather than a
             supplement to it or an aid in teaching it.  That is,
             the simple Monte Carlo method described here is
             complete in itself for handling most -- perhaps all --
             problems in probability and statistics (1976, p. 734,
             second italics added here).

             This does not include such matters as the design of

        experiments, and decision analysis.  It also would be better to

        use the term "problems in compound probability calculation".  And

        at that time we were not aware of some of the limitations of the

        bootstrap and presumably of other resampling tests that have been

        uncovered since then.