CHAPTER III-1

                 THE RESAMPLING METHOD FOR STATISTICAL INFERENCE


                                  INTRODUCTION

             This  chapter introduces the resampling method in historical

        and theoretical perspective, and illustrates the method.

             About 1615, Italian gamblers brought to Galileo Galilei a

        problem in the game of three dice.  The theorists of the day had

        figured as equal the chances of getting totals of 9 and 10 (also

        11 and 12), because there are the same number of ways (six) of

        making those points -- for example, a nine can be 126, 135, 144,

        234, 225, and 333.  But players had found that in practice 10 is

        made more often than 9.  How come?

             Galileo then invented the device of the "sample space" of

        possible outcomes.  He colored three dice white, gray, and black,

        and systematically listed every possible permutation.  The previ-

        ous theorists - including Gottfried Leibniz - had instead lumped

        together into a single category the various possible ways of

        getting (say) a 3, 3, and 4 to make 10.  That is, they listed

        combinations rather than permutations, and various combinations

        contain different numbers of permutations.

             Galileo's analysis confirmed the gamblers' empirical re-

        sults.  Ten does come up more frequently than 9, because there

        are 27 permutations that add to 10 whereas there are only 25

        permutations that add to 9.

             The use of repeated trials to learn what the gamblers wanted

        to know illustrates the power of the resampling method -- which

        we can simply call "simulation" or "experimentation" here. And

        with sufficient repetitions, one can arrive at as accurate an

        answer as desired.  Not only is the resampling method adequate,

        but in the case of three dice it was a better method than deduc-

        tive logic, because it gave the more correct answer. Though the

        only logic needed was enumeration of the possibilities, it was

        too difficult for the doctors of the day.  The powers of a Gali-

        leo were necessary to produce the correct logic.

             Even after Galileo's achievement, the powers of Blaise

        Pascal and Pierre Fermat were needed to correctly analyze with

        the multiplication rule another such problem - the chance of at

        least one ace in four dice throws. (This problem, presented by

        the Chevalier de la Mere, is considered the origin of probability

        theory.)  For lesser mathematical minds, the analysis was too

        difficult.  Yet ordinary players were able to discern the correct

        relative probabilities, even though the differences in probabili-

        ties are slight in both the Galileo and Pascal-Fermat problems.

        Simulation's effectiveness is its best argument.

             One might rejoin that the situation is different after Gali-

        leo, Pascal, Fermat and their descendants have invented analytic

        methods to handle such problems correctly.  Why not use already

        existing analytic methods instead of resampling?

             The existence of a correct algorithm does not imply that it

        will be used appropriately, however.  And a wrongly-chosen algo-

        rithm is far worse than no algorithm at all -- as the Chevalier's

        pocketbook attested.  In our own day, decades of experience have

        proven that "pluginski" -- the memorization of formulas that one

        cannot possibly understand intuitively -- may enable one to

        survive examinations, but does not provide usable scientific

        tools.

                       THE  DEFINITION OF RESAMPLING AGAIN

             Let's again define briefly "resampling".  A statistical

        procedure models a physical process.  A resampling method

        simulates the model with easy-to-manipulate symbols.  Either the

        observed data (in the case of problem in statistical inference),

        or a data-generating mechanism such as a die (if it is a problem

        in probability), are used to produce new hypothetical samples,

        the properties of which can then be examined.  The resampler

        postulates a universe and then examines how it behaves - in the

        case of statistics, comparing the outcomes to a criterion that we

        choose. More extended definition and discussion will be found in

        Chapter III-3.

             The same logic applies to problems labeled "probability" and

        "statistical inference".  The only difference is that in proba-

        bility problems the "model" is known in advance -- say, the model

        implicit in a deck of poker cards plus a game's rules for dealing

        and counting the results -- rather than the model being assumed

        to be best estimated by the observed data, as in resampling

        statistics. (Every problem in statistical inference contains a

        problem in probability at its core.)

             Now that resampling has become respectable, some statisti-

        cians grumble that it is "just" the Monte Carlo method, or "just"

        simulation.  But earlier on the body of resampling methods for

        statistics were not set forth under those labels.

             It is the overall approach - the propensity to turn first to

        resampling methods to handle practical problems - that most

        clearly distinguishes resampling from conventional statistics,

        and from the earlier use of Monte Carlo methods.  (In the nine-

        teenth century, "simulation techniques were all tied to the

        normal distribution, and all involved generating errors to be

        added to a signal", according to historian Stephen Stigler.)  In

        addition, some resampling methods are new in themselves, the

        result of the basic resample-it tendency of the past quarter

        century.


                       RESAMPLING AND STATISTICAL INFERENCE

             Chapter II-2 mentioned how John Arbuthnot, doctor to Queen

        Anne of England, observed that more boys than girls were bornin

        London; records showed that male births exceeded female 82 years

        in a row.  Arbuthnot therefore set forth to test the hypothesis

        that a universe with a 50-50 probability of producing males could

        result in 82 successive years with preponderantly male births.

             Arbuthnot used the multiplication rule of Pascal and Fermat

        to calculate that the probability of (1/2)82 is extremely small.

        "From whence it follows, that it is Art, not Chance, that "gov-

        erns" - that is, "Divine Providence".  (His argument is complex

        and debatable, as statistical inference often is; the mathematics

        is the easy part, especially when resampling methods are used.)

             Please notice that Arbuthnot could have considered the

        numbers of boys and girls observed in each year, rather than

        treating each year as a single observation - an even stronger

        test because of the vast amounts of information.  Arbuthnot

        surely did not analyze the data for any or all of the individual

        years because the calculus of probability was still in its infan-

        cy.

             Luckily, the test Arbuthnot made was more than powerful

        enough for his purposes.  But if instead of 82 years in a row,

        only (say) 81 or 61 of the 82 years had shown a preponderance of

        males, Arbuthnot would have lacked the tools for a test (though

        he knew the binomial and logarithms). Nowadays, one conventional-

        ly uses the Gaussian (Normal) approximation to the binomial

        distribution to produce the desired probability.  But that method

        requires acquaintance with a considerable body of statistical

        procedure, and utilizes a formula that almost no one knows and

        even fewer can explain intuitively.  Instead, users simply "plug

        in" the data to a table which, because it is an arcane mystery,

        invites misuse and erroneous conclusions.

             The experimental resampling method of earlier gamblers could

        easily have given Arbuthnot a satisfactory answer for (say) 61 of

        82 years, however.  He had in fact likened the situation to a set

        of 82 coins. He could simply have tossed such a set repeatedly,

        and found that almost never would as many as 81 or 61 heads

        occur.  He could then have rested as secure in his conclusion as

        with the formulaic assessment of the probability of 82 years in a

        row.  And because of the intuitive clarity of the experimental

        method, one would not be likely to make a misleading error in

        such a procedure.

             By the grace of the computer, such problems can be handled

        more conveniently today.  The self-explanatory commands in Figure

        III-1-1 suffice, using the language RESAMPLING STATS and pro-

        ducing the results shown there.         Figure III-1-1

             The intellectual advantage of the resampling method is that

        though it takes repeated samples from the sample space, it does

        not require that one know the size of the sample space or of a

        particular subset of it.  To estimate the probability of getting

        (say) 61 males in 82 births with the binomial formula requires

        that one calculate the number of permutations of a total of 82

        males and females, and the number of those permutations that

        include 61 or more males.  In contrast, with a resampling ap-

        proach one needs to know only the conditions of producing a

        single trial yielding a male or female.  This conceptual differ-

        ence, which will be discussed at greater length below, is the

        reason that, compared to conventional methods, resampling is

        likely to have higher "statistical utility" - a compound of

        efficiency plus the chance that the ordinary scientist or deci-

        sion-maker will use a correct procedure.


                        VARIETIES OF RESAMPLING METHODS

             A resampling test may be constructed for every case of

        statistical inference - by definition.  Every real-life situation

        can be modeled by symbols of some sort, and one may experiment

        with this model to obtain resampling trials.  A resampling method

        should always be appropriate unless there are insufficient data

        to perform a useful resampling test, in which case a conventional

        test - which makes up for the absence of observations with an

        assumed theoretical distribution such as the Normal or Poisson -

        may produce more accurate results if the universe from which the

        data are selected resembles the chosen theoretical distribution.

        Exploration of the properties of resampling tests is an active

        field of research at present.  Chapter III-2 will discuss the

        various types of resampling methods.

             For the main tasks in statistical inference - hypothesis

        testing and confidence intervals - the appropriate resampling

        test often is immediately obvious.  For example, if one wishes to

        inquire whether baseball hitters exhibit behavior that fits the

        notion of a slump, one may simply produce hits and outs with a

        random-number generator adjusted to the batting average of a

        player, and then compare the number of simulated consecutive

        sequences of either hits or outs with the observed numbers for

        the player. The procedure is also straightforward for such bino-

        mial situations as the Arbuthnot birth-sex case.

             Two sorts of procedures are especially well-suited to resam-

        pling:  1) A sample of the permutations in Fisher's "exact" test

        (confusingly, also called a "randomization" test).  This is

        appropriate when the size of the universe is properly assumed to

        be fixed, as discussed below.  2) The bootstrap procedure. This

        is appropriate when the size of the universe is properly assumed

        not to be fixed.

             Let's compare the permutation and bootstrap procedures in

        the context of a case which might be analyzed either way.  The

        discussion will highlight some of the violent disagreements in

        the philosophy of statistics which the use of resampling methods

        frequently brings to the surface - one of its great benefits.

             In the 1960s I studied the price of liquor in the sixteen

        "monopoly" states (where the state government owns the retail

        liquor stores) compared to the twenty-six states in which retail

        liquor stores are privately owned.  (Some states were omitted for

        technical reasons.  The situation and the price pattern has

        changed radically since then.)

             These were the representative 1961 prices of a fifth of Sea-

        gram 7 Crown whiskey in the two sets of states:

                  16 monopoly states: $4.65, $4.55, $4.11, $4.15, $4.20,

             $4.55, $3.80, $4.00, $4.19, $4.75, $4.74, $4.50, $4.10,

             $4.00, $5.05, $4.20

                  26 private-ownership states:  $4.82, $5.29, $4.89,

             $4.95, $4.55, $4.90, $5.25, $5.30, $4.29, $4.85, $4.54,

             $4.75, $4.85, $4.85, $4.50, $4.75, $4.79, $4.85, $4.79,

             $4.95, $4.95, $4.75, $5.20, $5.10, $4.80, $4.29. 
             The economic question that underlay the investigation -

        having both theoretical and policy ramifications - is as follows:

        Does state ownership affect prices?  The empirical question is

        whether the prices in the two sets of states were systematically

        different.  In statistical terms, we wish to test the hypothesis

        that there was a difference between the groups of states related

        to their mode of liquor distribution, or whether instead the ob-

        served $.49 differential in means might well have occurred by

        happenstance.  In other words, we want to know whether the two

        sub-groups of states differed systematically in their liquor

        prices, or whether the observed pattern could well have been

        produced by chance variability.

             At first I used a resampling permutation test as follows:

        Assuming that the entire universe of possible prices consists of

        the set of events that were observed, because that is all the

        information available about the universe, I wrote each of the

        forty-two observed state prices on a separate card.  The shuffled

        deck simulated a situation in which each state has an equal

        chance for each price.

             On the "null hypothesis" that the two groups' prices do not

        reflect different price-setting mechanisms, but rather differ

        only by chance, I then examined how often that simulated universe

        stochastically produces groups with results as different as

        observed in 1961.  I repeatedly dealt groups of 16 and 26 cards,

        without replacing the cards, to simulate hypothetical monopoly-

        state and private-state samples, each time calculating the dif-

        ference in mean prices.

             The probability that the benchmark null-hypothesis universe

        would produce a difference between groups as large or larger than

        observed in 1961 is estimated by how frequently the mean of the

        group of randomly-chosen sixteen prices from the simulated state-

        ownership universe is less than (or equal to) the mean of the

        actual sixteen state-ownership prices.  If the simulated differ-

        ence between the randomly-chosen groups was frequently equal to

        or greater than observed in 1961, one would not conclude that the

        observed difference was due to the type of retailing system

        because it could well have been due to chance variation.

             The computer program in Figure III-1-2, using the language

        RESAMPLING STATS performs the operations described above (MATHE-

        MATICA and APL could be used in much the same fashion).

                                 Figure III-1-2

             The results shown - not even one "success" in 10,000 trials
        - imply a very small probability that two groups with mean prices
        as different as were observed would happen by chance if drawn
        from the universe of 42 observed prices.  So we "reject the null
        hypothesis" and instead find persuasive the proposition that the
        type of liquor distribution system influences the prices that
        consumers pay.

             As I shall discuss later, the logical framework of this

        resampling version of the permutation test differs greatly from

        the formulaic version, which would have required heavy computa-

        tion.  The standard conventional alternative would be a Student's

        t-test, in which the user simply plugs into an unintuitive formu-

        la and table. And because of the unequal numbers of cases and

        unequal dispersions in the two samples, an appropriate t test is

        far from obvious, whereas resampling is not made more difficult

        by such realistic complications.

             A program to handle the liquor problem with an infinite-

        universe bootstrap distribution simply substitutes the random

        sampling command GENERATE for the TAKE command in Figure III-1-2.

        The results of the new test are indistinguishable from those in

        Figure III-1-2.


        INFARCTION AND CHOLESTEROL: RESAMPLING VERSUS CONVENTIONAL<1>

             Let's now consider one of the simplest numerical examples of

        probabilistic-statistical reasoning given toward the front of a

        standard book on medical statistics (Kahn and Sempos, 1989).

        Using data from the Framingham study, the authors ask:  What is

        an appropriate "confidence interval" on the observed ratio of

        "relative risk" (a measure which is defined below; it is closely

        related to the odds ratio) of the development of myocardial

        infarction 16 years after the study began, for men ages 35-44

        with serum cholesterol above 250, relative to those with serum

        cholesterol below 250?  The raw data are shown in Table III-1-1

        (divided into "high" and "low" cholesterol by Kahn and Sempos).

                                  Table III-1-1

        Hypothesis Tests With Measured Data

             Consider this classic question about the Framingham serum

        cholesterol data:  What is the degree of surety that there is a

        difference in myocardial infarction rates between the high- and

        low-cholesterol groups?

             The statistical logic begins by asking:  How likely is that

        the two observed groups "really" came from the same "population"

        with respect to infarction rates?  Operationally, we address this

        issue by asking how likely it is that two groups as different in

        disease rates as the observed groups would be produced by the

        same "statistical universe".

             Key step: we assume that the relevant "benchmark" or "null-

        hypothesis" population (universe) is the composite of the two

        observed groups.  That is, if there were no "true" difference in

        infarction rates between the two serum-cholesterol groups, and

        the observed disease differences occurred just because of

        sampling variation, the most reasonable representation of the

        population from which they came is the composite of the two

        observed groups.

             Therefore, we compose a hypothetical "benchmark" universe

        containing (135 + 470 =) 605 men at risk, and designate (10 + 21

        =) 31 of them as infarction cases.  We want to determine how

        likely it is that a universe like this one would produce - just

        by chance - two groups that differ as much as do the actually

        observed groups.  That is, how often would random sampling from

        this universe produce one sub-sample of 135 men containing a

        large enough number of infarctions, and the other sub-sample of

        470 men producing few enough infarctions, that the difference in

        occurrence rates would be as high as the observed difference of

        .029?  (10/135 = .074, and 21/470 = .045, and .074 - .045 =

        .029).

             So far, everything that has been said applies both to the

        conventional formulaic method and to the "new statistics"

        resampling method.  But the logic is seldom explained to the

        reader of a piece of research - if indeed the researcher

        her/himself grasps what the formula is doing.  And if one just

        grabs for a formula with a prayer that it is the right one, one

        need never analyze the statistical logic of the problem at

        hand.

             Now we tackle this problem with a method that you would

        think of yourself if you began with the following mind-set:  How

        can I simulate the mechanism whose operation I wish to under-

        stand?  These steps will do the job:

             1.  Fill an urn with 605 balls, 31 red (infarction) and the

        rest (605 - 31 = 574) green (no infarction).

             2.  Draw a sample of 135 (simulating the high serum-

        cholesterol group), one ball at a time and throwing it back

        after it is drawn to keep the simulated probability of an

        infarction the same throughout the sample; record the number of

        reds.  Then do the same with another sample of 470 (the low

        serum-cholesterol group).

             3.  Calculate the difference in infarction rates for the two

        simulated groups, and compare it to the actual difference of

        .029; if the simulated difference is that large, record "Yes" for

        this trial; if not, record "No".

             4.  Repeat steps 2 and 3 until a total of (say) 400 or 1000

        trials have been completed.  Compute the frequency with which the

        simulated groups produce a difference as great as actually

        observed.  This frequency is an estimate of the probability that

        a difference as great as actually observed in Framingham would

        occur even if serum cholesterol has no effect upon myocardial

        infarction.

             The procedure above can be carried out with balls in a

        ceramic urn in a few hours.  Yet it is natural to seek the added

        convenience of the computer to draw the samples.  Therefore, we

        illustrate in Figure III-1-4 how a simple computer program

        handles this problem. We use RESAMPLING STATS but the program can

        be executed in other languages as well, though usually with more

        complexity and less clarity.

                                  Figure III-1-4

             The results of the test using this program may be seen in

        the histogram in Figure III-1-4.  We find - perhaps surprisingly

        - that a difference as large as observed would occur by chance

        fully 10 percent of the time. (If we were not guided by the

        theoretical expectation that high serum cholesterol produces

        heart disease, we might include the 10 percent difference going

        in the other direction, giving a 20 percent chance). Even a ten

        percent chance is sufficient to strongly call into question the

        conclusion that high serum cholesterol is dangerous.  At a

        minimum, this statistical result should call for more research

        before taking any strong action clinically or otherwise.

             Where should one look to determine which procedures should

        be used to deal with a problem such as set forth above?  Unlike

        the formulaic approach, the basic source is not a manual which

        sets forth a menu of formulas together with sets of rules about

        when they are appropriate.  Rather, you consult your own

        understanding about what it is that is happening in (say) the

        Framingham situation, and the question that needs to be answered,

        and then you construct a "model" that is as faithful to the facts

        as is possible.  The urn-sampling described above is such a model

        for the case at hand.

             To connect up what we have done with the conventional

        approach, we apply a z test (conceptually similar to the t test,

        but applicable to yes-no data; it is the Normal-distribution

        approximation to the large binomial distribution) and we find

        that the results are much the same as the resampling result - an

        eleven percent probability.

             Someone may ask:  Why do a resampling test when you can use

        a standard device such as a z or t test?  The great advantage of

        resampling is that it avoids using the wrong method.  The

        researcher is more likely to arrive at sound conclusions with

        resampling because s/he can understand what s/he is doing,

        instead of blindly grabbing a formula which may be in error.

             The textbook from which the problem is drawn is an excellent

        one; the difficulty of its presentation is an inescapable

        consequence of the formulaic approach to probability and

        statistics.  The body of complex algebra and tables that only a

        rare expert understands down to the foundations constitutes an

        impenetrable wall to understanding.  Yet without such

        understanding, there can be only rote practice, which leads to

        frustration and error.


        Confidence Intervals for the Counted Data

             So far we have discussed the interpretation of sample data

        for testing hypotheses.  The devices used for the other main

        theme in statistical inference - the estimation of confidence

        intervals - are much the same as those used for testing hypothe-

        ses.  Indeed, the bootstrap method discussed above was originally

        devised for estimation of confidence intervals.  The bootstrap

        method may also be used to calculate the appropriate sample size

        for experiments and surveys, another important topic in statis-

        tics.

             Consider for now just the data for the sub-group of 135

        high-cholesterol men.  A second classic statistical question is

        as follows:  How much confidence should we have that if we were

        take a much larger sample than was actually obtained, the sample

        mean (that is, the proportion 10/135 = .07) would be in some

        close vicinity of the observed sample mean?  Let us first carry

        out a resampling procedure to answer the questions, waiting until

        afterwards to discuss the logic of the inference.

             1.  Construct an urn containing 135 balls - 10 red

        (infarction) and 125 green (no infarction) to simulate the

        universe as we guess it to be.

             2.  Mix, choose a ball, record its color, replace it, and

        repeat 135 times (to simulate a sample of 135 men).

             3.  Record the number of red balls among the 135 balls

        drawn.

             4.  Repeat steps 2-4 perhaps 1000 times, and observe how

        much the number of reds varies from sample to sample.  We

        arbitrarily denote the boundary lines that include 47.5 percent

        of the hypothetical samples on each side of the sample mean as

        the 95 percent "confidence limits" around the mean of the actual

        population.

             Figure III-1-5 shows how this can be done easily on the

        computer, together with the results.

                                  Figure III-1-5

             The variation in the histogram in Figure III-1-5 highlights

        the fact that a sample containing only 10 cases of infarction is

        very small, and the number of observed cases - or the proportion

        of cases - necessarily varies greatly from sample to sample.

        Perhaps the most important implication of this statistical

        analysis, then, is that we badly need to collect additional

        data.

             This is a classic problem in confidence intervals, found in

        all subject fields.  For example, at the beginning of the first

        chapter of a best-selling book in business statistics, Wonnacott

        and Wonnacott use the example of a 1988 presidential poll. The

        language used in the cholesterol-infarction example above is

        exactly the same as the language used for the Bush-Dukakis poll

        except for labels and numbers.

             Also typically, the text gives a formula without explaining
        it, and says that it is "fully derived" eight chapters later
        (Wonnacott and Wonnacott, 1990, p. 5).  With resampling, one
        never needs such a formula, and never needs to defer the
        explanation.

             The philosophic logic of confidence intervals is quite deep

        and controversial, less obvious than for the hypothesis test.

        The key idea is that we can estimate for any given universe the

        probability P that a sample's mean will fall within any given

        distance D of the universe's mean; we then turn this around and

        assume that if we know the sample mean, the probability is P that

        the universe mean is within distance D of it.  This inversion is

        more slippery than it may seem.  But the logic is exactly the

        same for the formulaic method and for resampling.  The only

        difference is how one estimates the probabilities - either with a

        numerical resampling simulation (as here), or with a formula or

        other deductive mathematical device (such as counting and

        partitioning all the possibilities, as Galileo did when he

        answered a gambler's question about three dice.)  And when one

        uses the resampling method, the probabilistic calculations are

        the least demanding part of the work.  One then has mental

        capacity available to focus on the crucial part of the job -

        framing the original question soundly, choosing a model for the

        facts so as to properly resemble the actual situation, and

        drawing appropriate inferences from the simulation.

             If you have understood the general logic of the procedures

        used up until this point, you are in command of all the necessary

        conceptual knowledge to construct your own tests to answer any

        statistical question.  A lot more practice, working on a variety

        of problems, obviously would help.  But the key elements are

        simple:  1) Model the real situation accurately, 2) experiment

        with the model, and 3) compare the results of the model with the

        observed results.


        Confidence Intervals on Relative Risk With Resampling

             Now we are ready to calculate - with full understanding -

        the confidence intervals on relative risk that the Kahn-Sempos

        text sought.  Recall that the observed sample of 135 high-

        cholesterol men had 10 infarctions (a proportion of .074), and

        the sample of 470 low-cholesterol men had 21 infarctions (a

        proportion of .045).  We estimate the relative risk of high

        cholesterol as .074/.045.  Let us frame the question this way:

        If we were to randomly draw a sample from the universe of high-

        cholesterol men that is best estimated from our data (.074

        infarctions), and a sample from the universe of low-cholesterol

        men (.045 infarctions), and do this again and again, within which

        bounds would the relative risk calculated from that simulation

        fall (say) 95 percent of the time?

             The operation is quite the same as that for a single

        confidence interval estimated above except that we do the

        operation for both sub-samples at once, and then calculate the

        ratio between their results to determine the relative risk.  As

        before, we would like to know what would happen if we could take

        additional samples from the universes that spawned our actual

        samples.  Lacking the resources to do so, we let those original

        samples "stand in" for the universes from which they came,

        serving as proxy "substitute universes."  It is as if we

        replicate each sample element a million times and then take

        "bootstrap" samples from this "proxy universe."  Paralleling the

        real world, we take simulated samples of the same size as our

        original samples.  (In practice we need not replicate each sample

        element a million times but instead achieve the same resampling

        effect by sampling with replacement from our original samples --

        that way, the chance that a sample element will be drawn remains

        the same from draw to draw.)  We count the number of infarctions

        in each of our resamples, and for the pair of resamples, then

        calculate the relative risk measure and keep score of this re-

        sult.  We repeat with additional pairs of resamples, each time

        calculating the relative risk measure, and examine the overall

        results.

             We may compare our results in Figure III-1-6 - a confidence

        interval extending from 0.69 to 3.4 - to the results given by

        Kahn and Sempos, which are 0.79 to 3.5, 0.80 to 3.4, and 0.79 to

        3.7 from three different formulas (pp. 62-63); our agreement is

        close.

                                  Figure III-1-6

                           OTHER RESAMPLING TECHNIQUES

             We have so far seen examples of three of the most common

        resampling methods - binomial, permutation, and bootstrap.  These

        methods may be extended to handle correlation, regression, and

        tests where there are three or more groups. Indeed, resampling

        can be used for every other statistic in which one may be inter-

        ested - for example, statistics based on absolute deviations

        rather than squared deviations.  This flexibility is a great

        virtue because it frees the statistics user from the limited and

        oft-confining battery of textbook methods.


                            SOME OTHER ILLUSTRATIONS A Measured-Data

        Example:  Test of a Drug to Prevent Low Birthweight

             The Framingham infarction-cholesterol examples worked with

        yes-no "count" data.  Let us therefore consider some

        illustrations of the use of resampling with measured data.

             Another leading textbook (Rosner, 1982, p. 257) gives the

        example of a test of the hypothesis that drug A prevents low

        birthweights.  The data for the treatment and control groups are

        shown in Table III-1-2.  The treatment group averaged .82 pounds

        more than the control group.  Here is a resampling approach to

        the problem:

                                  Table III-1-2

             1.  If the drug has no effect, our best guess about the

        "universe" of birthweights is that it is composed of (say) a

        million each of the observed weights, all lumped together.  In

        other words, in the absence of any other information or

        compelling theory, we assume that the combination of our samples

        is our best estimate of the universe.  Hence let us write each of

        the birthweights on a card, and put them into a hat.  Drawing

        them one by one and then replacing them is the operational equiv-

        alent of a very large (but equal) number of each birthweight.

             2.  Repeatedly draw two samples of 15 birthweights each, and

        check how frequently the observed difference is as large as, or

        larger than, the actual difference of .82 pounds.

             We find in Figure III-1-7 that only 1% of the pairs of
        hypothetical resamples produced means that differed by as much
        as .82.  We therefore conclude that the observed difference is
        unlikely to have occurred by chance.

                                  Figure III-1-7


        Matched-Patients Test of Three Treatments

             There have been several recent three-way tests of treatments

        for depression: drug versus cognitive therapy versus combined

        drug and cognitive therapy.  Consider this procedure for a

        proposed test in 31 triplets of people that have been matched

        within triplet by sex, age, and years of education. The three

        treatments are to be chosen randomly within each triplet.  Assume

        that the outcomes on a series of tests were ranked from best (#1)

        to worst (#3) within each triplet, and assume that the combined

        drug-and-therapy regime has the highest average rank.  How sure

        can we be that the observed result would not occur by chance?

              In hypothetical Table III-1-3 the average rank for the drug

        and therapy regime is 1.74.  Is it likely that the regimes do not

        "really" differ with respect to effectiveness, and that the drug

        and therapy regime came out with the best rank just by the luck

        of the draw?  We test by asking, "If there is no difference, what

        is the probability that the treatment of interest will get an

        average rank this good, just by chance?"

                                  Table III-1-3

             Figure III-1-8 shows a program for a resampling procedure

        that repeatedly produces 31 ranks randomly selected among the

        numbers 1, 2 and 3, and then averages the ranks.  We can then

        observe whether an average of 1.74 is unusually low, and hence

        should not be ascribed to chance.

                                  Figure III-1-8

             In 1000 repetitions of the simulation (10,000 would take

        just a few moments longer), 5% yielded average ranks as low as

        the observed value.  This is evidence that something besides

        chance might be at work here.  (The result is at the borderline

        of the traditional 5% "level of significance" (a p-value of .05),

        supposedly set arbitrarily by the great biostatistician R.A.

        Fisher on the grounds that a 1-in-20 happening is too

        coincidental to ignore.)  That is, the resampling test suggests

        that it would be very unlikely for a given treatment regime to

        achieve, just by chance, results as good as are actually

        observed.

             An interesting feature of the treatment problem is that it

        would be hard to find a conventional test that would handle this

        three-way comparison in an efficient manner. Certainly it would

        be impossible to find a test not requiring formulae and tables

        that only a talented professional statistician could manage

        satisfactorily, and even s/he is not likely to fully understand

        those formulaic procedures.


                           THE COMPUTER AND RESAMPLING

             Some now refer to resampling as "computer-intensive

        statistics" (e. g. Noreen, 1986).  And others have written that

        resampling had to await the easy availability of computers.  It

        is, however, arguable that computer cooperation is a crucial

        element of the sampling method. Resampling operations (including

        the bootstrap and permutation procedures) can often be conducted

        quite satisfactorily with simpler tools.  Indeed, the permuation

        test for the liquor prices example above was indeed done by hand

        with cards, and a bootstrap test could similarly have been done.

        Nevertheless, the inconvenience of doing tests by hand was a

        barrier to implementation and adoption.  Therefore, in the early

        1970s  developed Dan Weidenfeld and I  a computer language and a

        program for the mainframe that carries out resampling operations

        (including permutation tests, the bootstrap, and just about every

        other device) more expeditiously than simpler tools such as

        coins, dice, and random-number tables (Simon and Weidenfeld,

        1974); this is the same language that today is marketed for the

        personal computer under the name Resampling Stats.  More about

        the computer and resampling in the next chapter.


                        ON THE NATURE OF RESAMPLING TESTS

             Resampling is a much simpler intellectual task than the

        formulaic method because simulation obviates the need to calcu-

        late the number of points in the entire sample space.  This

        subject is explored in Chapter 00.


        REPEAT 1000

          GENERATE 82 1,2 A     Generate randomly 82 1s (males) or 2s

          COUNT A =1 B          Count the males

          SCORE B Z             Keep score of trial results

          END

        HISTOGRAM Z

        COUNT Z >= 61 K


             +
             +                       *
             +                       *
             +                      ***
           75+                   * ****
        F    +                   *******
        r    +                   *******
        e    +                   *******
        q    +                 ***********
        u  50+                 ***********
        e    +                 ***********
        n    +                ************
        c    +                **************
        y    +              ****************
           25+              ****************
             +              ****************
             +             ******************
             +             ******************
             +           *********************** *
            0+-------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
              20        30        40        50        60

                    Number of "male-predominant" years


        K        =          0

        Figure III-3-1



                     NUMBERS (482 529 489 495 455 490 525 530 429 485 454
        475 485 485 450 475 479 485 479 495 495 475 520 510 480 429) A
                     NUMBERS (465 455 411 415 420 455 380 400 419 475 474
        450 410 400 505 420) B
                     CONCAT A B C  Join the two vectors of data
                     REPEAT 1000   Repeat 1000 simulation trials
                       SHUFFLE C D  Shuffle the 42 state prices
                       TAKE D 1,26 E  Take 26 for the "private" group
                       TAKE D 27,42 F  Take the other 16 for the
        "monopoly" group
                       MEAN E EE  Find the mean of the "private" group.
                       MEAN F FF  Mean of the "monopoly" group
                       SUBTRACT EE FF G  Difference in the means
                      SCORE G Z    Keep score of the trials
                       END
                     HISTOGRAM Z  Graph of simulation results to compare
        with the observed result


                        75+
                     F    +                      *
                     r    +                   ******
                     e    +                   ******
                     q    +                   *******
                     u  50+               *  ********
                     e    +               * ********* *
                     n    +               *************
                     c    +               **************
                     y    +             * **************
                       25+             ***************** *
                          +            ********************
                          +          **********************
                          +          ************************
                          +      *******************************
                         0+-------------------------------------------
                            |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
                           -40       -20        0        20        40

                                Difference in average prices (cents)


                     (Actual difference = $0.49)

        Figure III-3-2



        URN 31#1 574#2 men             An urn called "men" with 31 "1s"
                                       (=infarctions) and 574 "2s"
                                       (=no infarction)
          SAMPLE 135 men high          Sample (with replacement!) 135
                                       of the numbers in this urn, give
                                       this group the name "high"
          SAMPLE 470 men low           Same for a group of 470, call
                                       it "low"
          COUNT high =1 a              Count infarctions in first group
          DIVIDE a 135 aa              Express as a proportion
          COUNT low =1 b               Count infarctions in second
                                      group
          DIVIDE b 470 bb              Express as a proportion
          SUBTRACT aa bb c             Find the difference in
                                       infarction rates
          SCORE c z                    Keep score of this difference
        END
        HISTOGRAM z
        COUNT z >=.029 k               How often was the resampled
                                       difference >= the observed
                                       difference?
        DIVIDE k 1000 kk               Convert this result to a
                                       proportion
        PRINT kk

          200+
             +
             +
        F    +
        r    +
        e 150+
        q    +
        u    +
        e    +
        n    +                     **
        c 100+                     **
        y    +                  ** ***
             +                  ** ***
        *    +                  ******
             +                  ****** *
        Z  50+                ***********
             +                ***********
             +               ************ **
             +              ****************
             +           *********************
            0+-------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
             -0.1      -0.05       0       0.05       0.1
                       Difference between paired resamples
                           (proportion with infarction)

        kk       =      0.102  (the proportion of resample pairs with a
        difference >= .029)


                 Figure III-1-4:  Test for Differences in Infarctions

        URN 10#1 125#0 men           An urn (called "men") with
                                     ten "1s" (infarctions)
                                     and 125 "0s" (no infarction)
        REPEAT 1000                  Do 1000 trials
          SAMPLE 135 men a           Sample (with replacement) 135
                                     numbers from the urn, put them in
                                     "a"
          COUNT a =1 b               Count the infarctions
          DIVIDE b 135 c             Express as a proportion
          SCORE c z                  Keep score of the result
        END                          End the trial, go back and repeat
        HISTOGRAM z                  Produce a histogram of all trial
                                     results
        PERCENTILE z (2.5 97.5) k    Determine the 2.5th and 97.5th
                                     percentiles of all trial results;
                                     these points enclose 95% of the
                                     results
        PRINT k


        F    +
        r    +
        e 150+
        q    +                *
        u    +              * *
        e    +              * **
        n    +             ** **
        c 100+             ** ** *
        y    +           * ** ** *
             +           * ** ** **
        *    +           * ** ** **
             +           * ** ** **
        Z  50+           * ** ** **
             +        * ** ** ** ** **
             +        * ** ** ** ** ** *
             +       ** ** ** ** ** ** *
             +       ** ** ** ** ** ** ** *
            0+-------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
               0       0.05       0.1      0.15       0.2
                    Proportion with infarction

        k        =   0.037037    0.11852


        (This is the 95% confidence interval, enclosing 95% of the resam-
        ple results)



                Figure III-1-5:  Confidence Interval Around Mean
        Difference

        URN 10#1 125#0 high        The universe of 135 high cholesterol
                                   men, 10 of whom ("1s") have infarc-
        tions
        URN 21#1 449#0 low         The universe of 470 low cholesterol
                                   men, 21 of whom ("1s") have infarc-
        tions
        REPEAT 1000                Repeat the steps that follow 1000
                                   times
          SAMPLE 135 high high$    Sample 135 (with replacement) from
                                   the high cholesterol universe, and
                                   put them in "high$" [the "$"
                                   suffix just indicates a resampled
                                   counterpart to the actual sample]
          SAMPLE 470 low low$      Similarly for 470 from
                                   the low cholesterol universe
          COUNT high$ =1 a         Count the infarctions in the first
                                   resampled group
          DIVIDE a 135 aa          Convert to a proportion
          COUNT low$ =1 b          Count the infarctions in the second
                                   resampled group
          DIVIDE b 470 bb          Convert to a proportion
          DIVIDE aa bb c           Divide the proportions to calculate
                                  relative risk
          SCORE c z                Keep score of this result
        END                        End the trial, go back and repeat
        HISTOGRAM z                Produce a histogram of trial results
        PERCENTILE z (2.5 97.5) k     Find the percentiles that
                                      bound 95% of the trial results
        PRINT k


        F    +                *
        r    +                *
        e  75+                * *
        q    +                * *
        u    +                * *
        e    +             **** *  *
        n    +             **** *  *
        c  50+          *  ******  *
        y    +          *  *********
             +          ************
        *    +          ************* *
             +         ****************   *
        Z  25+         ****************   *
             +         ********************
             +       * ********************  *
             +       *********************** *    *
             +     ****************************** * *   *
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
               0         1         2         3         4         5         6
                           Relative risk
        Results (estimated 95% confidence interval):

        k        =    0.68507     3.3944


              Figure III-1-6: Confidence Interval Around Relative Risk

        NUMBERS (6.9 7.6 7.3 7.6 6.8 7.2 8.0 5.5 5.8 7.3 8.2 6.9 6.8 5.7
                 8.6) treat
        NUMBERS (6.4 6.7 5.4 8.2 5.3 6.6 5.8 5.7 6.2 7.1 7.0 6.9 5.6 4.2
                 6.8) control
        CONCAT treat control all         Combine all birthweight observa-
                                         tions in same vector
          REPEAT 1000                    Do 1000 simulations
          SAMPLE 15 all treat$           Take a resample of 15 from all
                                         birthweights (the $ indicates
                                         a resampling counterpart to a
                                         real sample)
          SAMPLE 15 all control$         Take a second, similar resample
          MEAN treat$  mt                Find the means of the two
                                         resamples
          MEAN control$ mc
          SUBTRACT mt mc dif             Find the difference between the
                                         means of the two resamples
          SCORE dif z                    Keep score of the result
          END                            End the simulation experiment,
                                         go back and repeat
        HISTOGRAM z                      Produce a histogram of the
                                         resample differences
        COUNT z >= .82 k                 How often did resample
                                         differences exceed the observed
                                         difference of .82?

        F    +
        r    +
        e  75+
        q    +
        u    +
        e    +
        n    +                          * *   * *
        c  50+                          * * * *** *
        y    +                          ***********
             +                         ************ *
        *    +                        ***************
             +                       ****************
        Z  25+                    * ******************* *
             +                    * *********************
             +                   ** **********************
             +                 ******************************
             +               ***********************************
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
             -1.5       -1       -0.5        0        0.5        1        1.5
                           Resample differences in pounds

        Result:  Only 1.3% of the pairs of resamples produced means that
        differed by as much as .82.  We can conclude that the observed
        difference is unlikely to have occurred by chance.


               Figure III-1-7:  Test for Birthweight Differences



        REPEAT 1000                     Do 1000 simulations
          GENERATE 31 (1 2 3) ranks     Generate 31 numbers, each
                                        number a "1", "2" or "3", to
                                        simulate random assignment of
                                        ranks 1-3 to the drug/
                                        therapy alternative
          MEAN ranks rankmean           Take the mean of these 31
          SCORE rankmean z              Keep score of the mean
        END                             End the simulation, go back
                                        and repeat
        HISTOGRAM z                     Produce a histogram of the
                                        rank means
        COUNT z <=1.74 k                How often is mean rank better
                                        than 1.74, the observed value?
        PRINT k


          100+
             +                            *    *
             +                            *    *
        F    +                            *  * *
        r    +                         *  ** * *
        e  75+                         *  ** * *
        q    +                         ** ** * ** *
        u    +                         ** ** * ** *
        e    +                         ** ** * ** *
        n    +                         ** ** * ** **
        c  50+                         ** ** * ** **
        y    +                       * ** ** * ** ** *
             +                       * ** ** * ** ** *
        *    +                     * * ** ** * ** ** *
             +                    ** * ** ** * ** ** * *
        Z  25+                    ** * ** ** * ** ** * **
             +                  * ** * ** ** * ** ** * **
             +                  * ** * ** ** * ** ** * ** * *
             +             *    * ** * ** ** * ** ** * ** * *  *
             +          * ** ** * ** * ** ** * ** ** * ** * ** *
            0+---------------------------------------------------------------
               |^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|^^^^^^^^^|
              1.4       1.6       1.8        2        2.2       2.4       2.6


           Figure III-1-8:  Test for Improvement by Combined Depression
        Therapy

                                    Table III-1-1

         Development of Mycardial Infarction in Framingham After 16 Years
                   Men Age 35-44, by Level of Serum Cholesterol



        Serum cholesterol       Developed MI  Did not develop MI    Total
        (mg%)

        >250                          10           125               135
        <=250                         21           449               470



        Source: Shurtleff, D.  The Framingham Study: An Epidemiologic
        investigation of Cardiovascular Disease, Section 26.  Washington,
        DC, U.S. Government Printing Office.  Cited in Kahn and Sempos
        (1989), p. 61, Table 3-8



           Figure III-1-8:  Test for Improvement by Combined Depression
        Therapy

                                  Table III-1-2

                 Birthweights in a Clinical Trial to Test a Drug
                         for Preventing Low Birthweights


                                         Baby Weight (lb)
        Patient             Treatment group     Control group
             1                   6.9                 6.4
             2                   7.6                 6.7
             3                   7.3                 5.4
             4                   7.6                 8.2
             5                   6.8                 5.3
             6                   7.2                 6.6
             7                   8.0                 5.8
             8                   5.5                 5.7
             9                   5.8                 6.2
             10                  7.3                 7.1
             11                  8.2                 7.0
             12                  6.9                 6.9
             13                  6.8                 5.6
             14                  5.7                 4.2
             15                  8.6                 6.8


        Source: Rosner, Table 8.7





           Figure III-1-8:  Test for Improvement by Combined Depression
        Therapy

                                  Table III-1-3

            Observed Rank of Depression Treatments, by Effectiveness
                                  (Hypothetical)


                                         Treatment

        Triplet Group          Drug    Therapy    Drug/Therapy

             1                   3         1         2
             2                   2         3         1
             3                   1         3         2
             .                   .         .         .
             .                   .         .         .
             .                   .         .         .
             31                  2         1         3

                  Avg. rank     2.29     1.98      1.74




           Figure III-1-8:  Test for Improvement by Combined Depression
        Therapy

                                     ENDNOTES




        **ENDNOTES**

             <1>: The  following  biostatistical  examples are joint work

        with Peter C. Bruce. 

           Figure III-1-8:  Test for Improvement by Combined Depression
        Therapy