CHAPTER III-4 WHAT "MONTY HALL" TEACHES ABOUT THE THEORY OF RESAMPLING: SOME DIFFICULT PROBLEMS, AND THE COMPUTATION OF SAMPLE SPACES This chapter is a reworked version of the article Simon, Julian L., "What Some Puzzling Problems Teach About the Theory of Simulation and the Use of Resampling," The American Statistician, November 1994, Vol. 48, No. 4, pp. 1-4. A draft follows. WHAT SOME PUZZLING PROBLEMS TEACH ABOUT THE THEORY OF SIMULATION AND THE USE OF RESAMPLING Julian L. Simon One of the reasons given for preferring conventional methods over resampling methods is that formulaic methods develop "insight" (Kruskal, 1984). But though formulas may help some persons improve their understanding of the workings of a mathematical or real-world system, in other cases resampling methods can be more effective than conventional methods in developing insight into the mechanisms. This will be illustrated by analogy with some puzzling problems in probability which, though not themselves problems in statistical inference, show the power of physical simulation and thereby illuminate the benefits of the resampling process. Furthermore, that the resampling method has stronger roots in the intuition of most people than does the conventional method has been shown by controlled experiments which test the number of problems solved correctly with the two methods (Simon, Atkinson, and Shevokas, 1976). Indeed, it is reasonable to ask whether solving a problem which requires the Normal approximation to a binomial can be well-understood intuitively by even well-trained mathematicians. (What is the meaning of pi in the formula for the Gaussian distribution?) But resampling simulation of such systems is usually obvious and easy, as discussed below. Some terminology: "Simulation" here means any process of Monte Carlo experimental repetitions of a model of a probabilistic process. "Resampling" refers to the sub-class of such simulations statistical problems done with simulations; it includes bootstrap and Fisherian permutation methods as well as simulation techniques for dealing with a variety of other problems such as binomial proportions. (See Simon, 1969; 3rd edition 1985 with Burstein, for a review of many such methods.) SOME PUZZLING PROBLEMS: DEDUCTION VERSUS SIMULATION The Three-Door Problem Consider the now-famous problem of the three doors, long known by statisticians but recently popularized in Parade as follows (vos Savant, 1990a): The player faces three closed containers, one containing a prize and two empty. After the player chooses, s/he is shown that one of the other two containers is empty. The player is now given the option of switching from her/his original choice to the other closed container. Should s/he do so? Answer: Switching doubles the chances of winning. The thousands of responses columnist Marilyn vos Savant received (1990b, 1991a, 1991b)--including a good many from Ph.D.'s in mathematics--show that logical mathematical deduction fails badly in this case. Most people--both laypersons and statisticians--arrive at the wrong answer. Simulation, however -- and hands-on simulation with physical symbols, rather than computer simulation -- is a surefire way of obtaining and displaying the correct solution. Table 1 shows such a simple simulation with a random-number table. Column 1 represents the box you choose, column 2 where the prize is. Based on columns 1 and 2, column 3 indicates the box that the "host" would now open and show to be empty. Lastly, column 4 scores whether the "switch" or "remain" strategy would be preferable. A count of the number of winning cases for "switch" and "remain" gives the result sought. Table 1 Unlike computer simulations, this process is transparent. Not only is the best choice obvious, but you are likely to understand quickly why switching is better. No other mode of explanation or solution brings out this intuition so well. And it is much the same with other problems in probability and statistics. Simulation can provide not only answers but also insight into why the process works as it does. In contrast, formulas produce obfuscation and confusion for most non- mathematicians. Another way to elucidate the three-box problem is to list the elements of the sample set (all of which have the same probability), as shown in Table 2. But though this sample space is easy to understand after it is placed before you, constructing it requires much more depth of understanding of the problem's logic than does simulation - which does not require that one count the possible outcomes or the number of "successes"; this may explain why it is much easier to err with sample-space analysis in this and other problems. Table 2 The Beagle Problem Ms. vos Savant also published this problem in Parade: A shopkeeper says she has two new baby beagles to show you, but she doesn't know whether they're male, female, or a pair. You tell her that you want only a male, and she telephones the fellow who's giving them a bath. "Is at least one a male?" she asks him. "Yes!" she informs you with a smile. What is the probability that the other one is [also] a male? vos Savant gave the answer as one in three. Again PhDs wrote her to say - with great confidence - that she was all wrong. So let's see how simulation handles the problem. Consider a two-digit column of random numbers in Table 3, using odd numbers for females and even for males. The first forty lines are sufficient to suggest the correct probability, and also to make clear the mechanism: Two-female pairs, a fourth of the cases, are excluded from the sample. And mixed pairs - which give a "no" answer - are two-thirds of the remaining pairs, whereas the only pairs that give "yes" answers - two males - are only a third of the remaining pairs. So once more simulation gets it right very quickly and easily, whereas the deductive method of mathematical logic results in much confusion. Table 3 The "beagle" problem again illustrates the power of simulation, and it supports by analogy the general use of the resampling method in statistics because it reveals the limitations of human deductive capacity, even among those who are mathematically adept. The Aces Problem The aces problem is particularly compelling because two eminent students of probability argued and disagreed about it right through the first edition of Warren Weaver's book, Lady Luck. From the second edition: A [bridge] hand is dealt, and you state to the other three players, "I have an ace." What is the probability, as calculated, of course, by someone other than you (for you know what you have) that you also have another ace? This, says Mr. [Martin] Gardner, is 5359/14498, which is about 0.37. A few hands later you announce, "I have the Ace of Spades." What now is the probability that you also have another ace? Surprisingly enough, Mr. Gardner says, the answer now is 11686/20825, or about 0.56. Why does the knowledge of which ace affect the answer? Mr. Gardner goes on to say that you can analyze this situation by considering the much simpler but similar problem of dealing a hand of two cards from a "deck" consisting of only four cards, namely the Ace of Spades, Ace of Hearts, Jack of Diamonds, and two of Clubs. There are only 6 possible two-card hands from this deck. Five of them have at least one ace and in only one of these five is there a second ace. Three of them contain the Ace of Spades, and one of these three contains a second ace. So in this illustration the two answers would be 1/5 and 1/3! (Weaver, 1963, p. 363) This problem is sufficiently difficult that Weaver got it wrong in his first edition. However, you can quickly arrive at the answer by playing the game by selecting a pair of random numbers. Set "1" = ace of hearts, "2" = ace of spades, "3" and "4" = non-aces (and to save time, using comparable numbers "5-8" as well, and ignoring "9" and "0" and duplicate numbers; columns 1 and 2 in Table 4). Considering the first situation Weaver mentions (where you announce you have an ace), mark the cases "yes" or "no" (column 3 in Table 4). Now go back through the same set of random numbers but considering only those where you have a ace of spades (column 4). You immediately see that this excludes some of the cases you have previously marked "no" but none that you marked "yes". Hence the probability of a "yes" in the second case must be greater - simply because there are fewer cases that qualify to be either "yes" or "no". Table 4 The aces problem usually becomes clear as you simulate, though the logic is demonstrably difficult when done abstractly. However, the "bottom line" (except to professional mathematicians) is not whether you come to understand the mathe- matical logic but whether you can arrive at the right answer. And if you present to a group some problems like this and the preceding ones (or others such as the birthday problem), after you have shown people how to simulate the answer to one or two of the problems, it will be quite clear that they are more able to deal correctly with subsequent problems than if they only practice handling such problems deductively. Meeting Under the Clock Two persons agree to arrive at the town clock sometime tomorrow at some minute between 1PM and 2PM, and to remain for 20 minutes. What is the probability that they will be there at the same moment? Reasoning this out is not easy. But simulation with 60 pieces of paper in a hat is easy. (This problem is the canon of many queuing problems, of course. See, e.g., Singh, 1972, pp. 177-178.) The "Stop After A Boy" Rule of Family Formation Consider the behavioral rule of couples continuing to produce children until they have one boy. Many people's intuition will be that this rule will lead to a different sex ratio than another behavioral rule. Logical argument may or may not be successful in persuading a person to the contrary. But simulating this rule by proceeding sequentially through the numbers in a random number chart soon convinces all that the rule has no effect. The Game of 26 Back in the days before it was banned by the mayor, you could go into a Chicago bar and play the game of 26. Therein, the player picks a number from "1" to "6", and wins if that "point" comes up 26 or more times in 130 rolls of the dice (in practice, ten dice rolled thirteen times). As stated, this is a problem in pure probability. Or if a given sample of results is observed, a similar situation can be framed as a canonical problem in statistics -- i.e. whether the observed sample is likely to have arisen by chance. The calculations are sufficiently laborious that, in the days before personal computers, both John Scarne (author of the famous John Scarne on Dice) and Martin Gardner (who for years wrote the mathematical games column for Scientific American) gave up on the job. The house usually paid off at 4 to 1, but Scarne could do no better than quote a gambling supply firm saying that "the percentage in favor of the operator is...17 per cent...but I have a hunch they're guessing." Despite the difficulty of formulaic calculation, by repeated trials the game operators obviously obtained reasonable estimates of the odds. And the following short and intuitive computer program (using the language RESAMPLING STATS, but APL or MATHEMATICA would also be similar) quickly gives the answer. REPEAT 15000 GENERATE 130 1,6 A create 130 random numbers, "1-6" COUNT A =1 B count the "points" SCORE B Z keep track of the trial's result END COUNT Z >=26 K count the number of results equal or greater than 26. Result: Between 18 and 19 percent "success", not far from the bar odds. THE SECRET OF SIMULATION'S POWER The intellectual advantage of simulation in general, and of the resampling method in statistics, is that though it takes repeated samples from the sample space, it does not require that one know the size of the sample space, or the number of points in a particular partition of it. To calculate the likelihood of getting (say) 26 "4"s in 130 dice throws with the binomial formula requires that one calculate the total number of possible permutations of 130 dice, and the number of those permutations that include 26 or more "4"s. Gaussian distribution-based methods often are used to approximate such processes when sample sizes become large, which introduces another escalation in the level of intellectual difficulty and obscurity for persons who are not skilled professional mathematicians. In contrast, with a simulation approach one needs to know only the conditions of producing a single "4" in one roll. Indeed, it is the much lesser degree of intellectual difficulty which is the secret of simulation's success, because it improves the likelihood that the user will arrive at a sound solution to the problem at hand - which I hope that you will agree is the ultimate criterion. One may wonder how simulation performs this economical trick of avoiding the complex abstraction of sample-space calculations. The explanation is that it substitutes the particular information about how elements in the sample are generated in that specific case, as derived from the facts of the actual circumstance; the analytic method does not use this information. Recall that Galileo solved the problem of why three dice yield "10" and "11" more frequently than "9" and "12" by inventing the concept of the sample space. He listed all 256 possible permutations, and found that there are 27 permutations that yield "10" and "11", but only 25 that yield "9" and "12". But it took a Galileo to perform this intellectual feat, simple as it seems to professionals now; lesser theorists had made the error of assuming that the probabilities of all four numbers are the same because the same number of combinations yields all four numbers. That is, for the gamblers before Galileo - who from long experience had correctly determined that "10" and "11" had the higher chances of success than "9" and "12" - simulation used the assumed facts that three fair dice are thrown with an equal chance of any outcome. They then took advantage of the results of a series of such events, performed one at a time stochastically; in contrast, Galileo made no use of the actual stochastic element of the situation, and did not gain information from a sequence of such trials; instead, he replaced all possible sequences by computation of their number (actually, a complete enumeration). Simulation is not "just" a stochastic shortcut to the results of the formulaic method. Rather, it is a quite different route to the same endpoint, using different intellectual processes and utilizing different sorts of inputs. As a partial analogy, it is like fixing a particular fault in an automobile with the aid of a how-to-do-it manual's checklist, compared to writing a book about the engineering principles of the auto; the author may be no better at fixing cars than the hobbyist is at writing engineering books, but the hobbyist requires less time to learn the task to which he addresses him\herself -- fixing the car's fault -- than the author needs to learn to write learned works. SUMMARY Simulation - and its sub-class of resampling problems in statistics - is a much simpler task intellectually than the formulaic method of probabilistic calculations because it does not require that one calculate a) the number of points in the entire sample space, and b) the number of points in some sub-set, so as to estimate the ratio of the latter to the former. Instead, one directly estimates the ratio from a sample. Even slightly difficult problems involving permutations and combinations are sufficiently difficult as to require advanced training. Similarly, it is much easier to sample the proportion of black grains of sand on a beach than it is to take a census of the total number of grains of each color. The latter is a task both of great intellectual difficulty as well as great practical difficulty. REFERENCES Kruskal, Kruskal, correspondence, 1984 Simon, J. L., (1969), and third edition (1985) with P. Burstein, Basic Research Methods in Social Science (New York: Random House, 1969). Scarne, John, Scarne on Dice, xerox of page 356 sent to me with correspondence from Martin Gardner, November 16, 1991 or December 10, 1991. Book not available at most libraries. vos Savant, Marilyn, letter to Martin Gardner, November 8, 1991, contained both the beagles problem and the letters she planned to print, including 5 from PhDs. vos Savant, Marilyn, "Ask Marilyn", Parade, September 9, 1990a, p. 15. vos Savant, Marilyn, "Ask Marilyn", Parade, December 2, 1990b, p. 25. vos Savant, Marilyn, "Ask Marilyn", Parade, February 17, 1991a, p. 12. vos Savant, Marilyn, "Ask Marilyn", Parade, July 7, 1991b, p. 28. Simon, Julian L.; Atkinson, David T., and Shevokas, Carolyn, "Probability and Statistics: Experimental Results of a Radically Different Teaching Method", American Mathematical Monthly, vol. 83, no. 9, Nov. 1976, pp. 733-739. Singh, Jagjit, Great Ideas of Operations Research (New York: Dover, 1972). Weaver, Warren, Lady Luck: The Theory of Probability (New York: Anchor Books, 1963). TABLE 2 SAMPLE SPACE ANALYSIS OF MONTY HALL PROBLEM Location of Your Host Winning Prize Choice Opens Strategy 1 1 2 or 3 Remain 1 2 3 Switch 1 3 2 Switch 2 1 3 Switch 2 2 1 or 3 Remain 2 3 1 Switch 3 1 2 Switch 3 2 1 Switch 3 3 1 or 2 Remain ABSTRACT WHAT SOME PUZZLING PROBLEMS TEACH ABOUT THE THEORY OF SIMULATION AND THE USE OF RESAMPLING Julian L. Simon Simulation is simpler intellectually than the formulaic method because it does not require that one calculate a) the number of points in the entire sample space, and b) the number of points in some sub-set. Instead, one directly samples the ratio. This article presents probabilistic problems that confound even skilled statisticians when attacking the problems deductively, yet are easy to handle correctly, and become clear intuitively, with physical simulation. This analogy demonstrates the useful- ness of simulation in the form of resampling methods. AFTERNOTE TO CHAPTER III-4 AN EXTENDED PUZZLE EXAMPLE ILLUSTRATING BAYES' LAW Consider this puzzle by a great logician. There are two balls in an urn, one white and the other either white or black. You draw a white and replace it. What is the chance that the next draw will be white? Let us first note that this seems to be a problem in "statistics" rather than in "probability", in the sense that we are asking questions about the identity of an unknown generating universe, rather than about the behavior of a known generating universe. And it is a question in the tradition of Bayes' formula, which is one of the great and deep (though simple) mathematical and statistical ideas of all time. It is easy to confuse two related cases - a) the probability that if the first ball is not replaced, the next ball drawn would be white, and b) the probability that the balls in the urn are both white, or one white (W) and one black (B). We will consider both of them. First let us ask about the probability that the very first ball drawn from the urn will be W. One way to estimate this is to create two urns - one with WW and the other with WB, and draw randomly from the two with equal probability - an assumption that we now make explicit. 1. One urn with WW and the other with WB. 2. Draw randomly from the two urns with equal probability (by flipping a coin, say). Record the result as "B" or "W", "0" or "1". 3. Repeat (2) a hundred times. 4. Count proportion "W". After doing this simple experiment, it may be useful to draw out the explanation for the (close to) .75 proportion (probability) we observe. In half the cases the coin flip gives us the WW urn (whose identity we never observe directly), in which case we always draw and observe W; hence half the total observations are W from that source alone. In the other half of the flip-determined drawings we get the WB urn, and then (on average) half the observations are W and half B. So a total of (1/2 * 1) + (1/2 * 1/1) = .75 are W. Next suppose we draw one counter and find it W. Now we ask the probability that the next counter from the same urn will be W. (Note how this differs from the puzzle which asks about the probability that both are white). Whereas before we could specify the universe - two urns, WW and WB - now we can no longer specify the universe we are referring to; it is either WW or WB. This transition in thinking is subtle, and you could think about the matter some other way if you wish, but I believe that this transition is important because it brings in Bayes' formula for the first time, and marks the change from pure probability to a statistical outlook. We can proceed experimentally: 3a. Following steps (1-2) above: If the counter drawn is "B", end the trial without any record and start again; otherwise proceed with this trial. It is this fork in the path that makes the problem puzzling because it happens almost unnoticed. 4a. If "W" is drawn in step (3a), draw another counter. Record "B" or "W". 5. Repeat (2-4) perhaps 100 times and count the proportion "B". This gives the answer to the original problem for a total of two counters drawn (or, one after the original counter). We can perform this experiment with a simple computer program: We estimate that the probability sought is about p = 2/3 (??). We can ask another related question at this point: What is the probability that the urn we are drawing from contains WW? This sounds more like a statistics or "inverse probability" question than before. Earlier we saw that a .5 chance of WW and a .5 chance of WB produces a .75 chance of the first counter being W. It is intuitively clear that after the first counter is drawn and found to be W, the chance is greater than .5 that the urn is WW. But how much greater? One way to determine this probability is to draw again and again and again from the same urn to see if the result is always W or sometimes B (stopping after we get one B, of course, because that settles the question immediately). We can't draw forever, but we can certainly sneak up on forever by drawing (say) up to 25 times (or 100 if you like), stopping if we get B. So let's do that: 6. Keep drawing from the urn that gave us one W until we get a B and so record and then stop, or reach 25 W's and record W. 7. Count the proportion W. It is here that the Reverend Thomas Bayes gave us a formula to do the previous task in an expeditious way - but a way that is very subtle and deep. He asks the question: How great is the probability that a WW urn will produce one W, compared to the probability of a WB urn? The answer clearly is 1.0 versus .5, or twice as likely. So P > 1.0/(1.0 + .5) = 2/3. (Notice how this is different than the earlier .5 chance that it will be either urn that will be picked.) We than reverse field and say: If we observe a single W, there is twice as great a chance that it came from a WW urn than from a WB urn. This crucial and mind-bending reversal in direction of thought is Bayes' contribution. Notice also how we proceeded from asking the probabilistic question about the likelihoods of WW and WB urns to produce a W, to asking the likelihood that if we get a W it came from a WW urn. So simple a change, but so confusing. This is why statistics is such a difficult subject - not because of complex mathematics, but because of the subtle changes in direction of thinking that are required (as well as because of the large number of issue that must be dealt with modeling and making other judgments while proceeding from the original question to the final interpretation). As many times as I have worked with and taught the Bayes' simple little Rule, I forget it and then have to think it through anew; I find it challenging every time. Bayes reasoned (whereas we experimented above to get the answer) that if the chance of WW is twice that of WB (after getting one W), then we can calculate that two out of three times (on average) the urn that produces a W will be WW. Read this paragraph about a dozen times to see if you follow Bayes' logic; he proceeds from absolute probabilities for each urn - 1.0 and.5 - to relative probabilities. The relative probability for WW is 1.0/(1.0 + .5) = 2/3, and for WB it is .5/(1.0 + .5) = 1/3. The heart of Bayes' invention is this shift from absolute to relative probabilities. And this can only be done by removing from consideration all those cases which are initially B. That is, we ignore the .25 of all cases in which the first counter is B, and consider only the .75 of all cases which are W, finding that in .5/.75 of the cases the urn is WW, and in .25/.75 it is WB. Just as with the puzzle, it is this fork in the path at which we proceed by ignoring some of the cases that makes both the puzzle and Bayes' thinking confusing, I believe. Let us digress to compare the situation at hand with the statistical procedure known as hypothesis testing. If someone claims to be able to increase the proportion of female calves upwards from .5, and an experiment produces 9 of 10 females, you "test the hypothesis" that the treatment works by asking the absolute probability that a .5 universe (or urn) will produce 9 of 10 females. If that probability is low, you conclude that the sample did not come from that universe. But you do not conclude that the sample came from a .9 universe, even though that might be your best guess; rather, you conclude that the sample came from a universe greater than .5. So the "alternative hypothesis" is vague in this case - "greater than .5" - whereas in the white- black ball problem the alternatives are both sharp - either WW or WB. The use of Bayes' formula always requires sharply stated alternatives. It is that requirement that distinguishes it from hypothesis testing, and that gives it its power when that requirement is met (as it seldom is in science except in genetics), though it is frequently met with in medical diagnosis and sometimes in industrial quality control. The two modes of thinking also are different in that one never puts aside one set of cases in testing hypotheses. A few paragraphs earlier I used the words "came from" in describing the connection between an observation and a universe. Nowhere else in statistical inference is it reasonable or justified to use this term, in my view; rather, it is only warranted to speak about the samples that a given universe "produces". The reason for using "came from" - and Lewis Carroll (from whom I adapted this problem) uses those words, as well as writers on Bayesian techniques - is that the situation is framed extremely narrowly in situations to which we apply Bayes' Law. We allow there to be no other alternative source than those analyzed, and we allow of no other possible outcomes than the ones permitted by the initial statement of the facts of the situation. By this sharp delineation, we convert the inference to a situation that is as close to deduction as is logically possible. Hence there is no room for a logical or empirical break between the relative probabilities of two (or more) universes producing a particular sample and the probabilities that the sample came from the various stated probabilities. It is, of course, infrequent that situations in the world match these characteristics, but in some cases - such as whether or not a person should be considered to have a given disease - the application of Bayes' Law, and the use of the term "came from" may be quite warranted. Let's now go a bit further with Bayes' Rule, making the probability of a WW urn equal .2 and that of a WB urn .8 - the equivalent of one WW and four WB urns to be chosen from randomly. These are now explicitly "prior probabilities" for Bayes-type thinking. (The particular probabilities were chosen so they will not be confused with other probabilities in the problem.) It is easy to see, or to arrive at with experimentation, that before the first counter is drawn the probability of a W is .6; there are the equivalent of a total of six W and 4 B with an equal 1/5 * 1/2 of being chosen. What about after a W is observed on the first draw? The probability that a W will be produced by a WW universe is 1/5 * 1 = .2, while the probability that a W will be produced by a WB universe is 4/5 * 1/2 = .4. So the probability that the observed W came from a WW universe is .2/(.2 + .4) = 1/3, according to Bayes' Rule. From this we can compute the probability that a W will be observed next time by substituting 1/3 and 2/3 as the "prior probabilities" for the two urns. [And we can continue on in this fashion for any number of observed W's.??? No.] One can get the desired probabilities directly by experimentation without knowing Bayes' Law. The key modeling step is excluding a trial from consideration, without any record, if it produces a W before the sequence reaches the successive W about which you are presently inquiring.