CHAPTER III-4
WHAT "MONTY HALL" TEACHES ABOUT THE THEORY OF RESAMPLING:
SOME DIFFICULT PROBLEMS, AND THE COMPUTATION OF SAMPLE SPACES
This chapter is a reworked version of the article
Simon, Julian L., "What Some Puzzling Problems Teach About
the Theory of Simulation and the Use of Resampling," The American
Statistician, November 1994, Vol. 48, No. 4, pp. 1-4. A draft
follows.
WHAT SOME PUZZLING PROBLEMS TEACH
ABOUT THE THEORY OF SIMULATION AND THE USE OF RESAMPLING
Julian L. Simon
One of the reasons given for preferring conventional methods
over resampling methods is that formulaic methods develop
"insight" (Kruskal, 1984). But though formulas may help some
persons improve their understanding of the workings of a
mathematical or real-world system, in other cases resampling
methods can be more effective than conventional methods in
developing insight into the mechanisms. This will be illustrated
by analogy with some puzzling problems in probability which,
though not themselves problems in statistical inference, show the
power of physical simulation and thereby illuminate the benefits
of the resampling process.
Furthermore, that the resampling method has stronger roots
in the intuition of most people than does the conventional method
has been shown by controlled experiments which test the number of
problems solved correctly with the two methods (Simon, Atkinson,
and Shevokas, 1976). Indeed, it is reasonable to ask whether
solving a problem which requires the Normal approximation to a
binomial can be well-understood intuitively by even well-trained
mathematicians. (What is the meaning of pi in the formula for
the Gaussian distribution?) But resampling simulation of such
systems is usually obvious and easy, as discussed below.
Some terminology: "Simulation" here means any process of
Monte Carlo experimental repetitions of a model of a
probabilistic process. "Resampling" refers to the sub-class of
such simulations statistical problems done with simulations; it
includes bootstrap and Fisherian permutation methods as well as
simulation techniques for dealing with a variety of other
problems such as binomial proportions. (See Simon, 1969; 3rd
edition 1985 with Burstein, for a review of many such methods.)
SOME PUZZLING PROBLEMS: DEDUCTION VERSUS SIMULATION
The Three-Door Problem
Consider the now-famous problem of the three doors, long
known by statisticians but recently popularized in Parade as
follows (vos Savant, 1990a):
The player faces three closed containers, one containing a
prize and two empty. After the player chooses, s/he is shown
that one of the other two containers is empty. The player is now
given the option of switching from her/his original choice to the
other closed container. Should s/he do so? Answer: Switching
doubles the chances of winning.
The thousands of responses columnist Marilyn vos Savant
received (1990b, 1991a, 1991b)--including a good many from
Ph.D.'s in mathematics--show that logical mathematical deduction
fails badly in this case. Most people--both laypersons and
statisticians--arrive at the wrong answer.
Simulation, however -- and hands-on simulation with physical
symbols, rather than computer simulation -- is a surefire way of
obtaining and displaying the correct solution. Table 1 shows
such a simple simulation with a random-number table. Column 1
represents the box you choose, column 2 where the prize is.
Based on columns 1 and 2, column 3 indicates the box that the
"host" would now open and show to be empty. Lastly, column 4
scores whether the "switch" or "remain" strategy would be
preferable. A count of the number of winning cases for "switch"
and "remain" gives the result sought.
Table 1
Unlike computer simulations, this process is transparent.
Not only is the best choice obvious, but you are likely to
understand quickly why switching is better. No other mode of
explanation or solution brings out this intuition so well. And
it is much the same with other problems in probability and
statistics. Simulation can provide not only answers but also
insight into why the process works as it does. In contrast,
formulas produce obfuscation and confusion for most non-
mathematicians.
Another way to elucidate the three-box problem is to list
the elements of the sample set (all of which have the same
probability), as shown in Table 2. But though this sample space
is easy to understand after it is placed before you, constructing
it requires much more depth of understanding of the problem's
logic than does simulation - which does not require that one
count the possible outcomes or the number of "successes"; this
may explain why it is much easier to err with sample-space
analysis in this and other problems.
Table 2
The Beagle Problem
Ms. vos Savant also published this problem in Parade:
A shopkeeper says she has two new baby beagles to show you,
but she doesn't know whether they're male, female, or a
pair. You tell her that you want only a male, and she
telephones the fellow who's giving them a bath. "Is at
least one a male?" she asks him. "Yes!" she informs you
with a smile. What is the probability that the other one
is [also] a male?
vos Savant gave the answer as one in three. Again PhDs wrote
her to say - with great confidence - that she was all wrong. So
let's see how simulation handles the problem.
Consider a two-digit column of random numbers in Table 3,
using odd numbers for females and even for males. The first
forty lines are sufficient to suggest the correct probability,
and also to make clear the mechanism: Two-female pairs, a fourth
of the cases, are excluded from the sample. And mixed pairs -
which give a "no" answer - are two-thirds of the remaining pairs,
whereas the only pairs that give "yes" answers - two males - are
only a third of the remaining pairs. So once more simulation
gets it right very quickly and easily, whereas the deductive
method of mathematical logic results in much confusion.
Table 3
The "beagle" problem again illustrates the power of
simulation, and it supports by analogy the general use of the
resampling method in statistics because it reveals the
limitations of human deductive capacity, even among those who are
mathematically adept.
The Aces Problem
The aces problem is particularly compelling because two
eminent students of probability argued and disagreed about it
right through the first edition of Warren Weaver's book, Lady
Luck. From the second edition:
A [bridge] hand is dealt, and you state to the
other three players, "I have an ace." What is the
probability, as calculated, of course, by someone other
than you (for you know what you have) that you also
have another ace? This, says Mr. [Martin] Gardner, is
5359/14498, which is about 0.37.
A few hands later you announce, "I have the Ace of
Spades." What now is the probability that you also
have another ace?
Surprisingly enough, Mr. Gardner says, the answer
now is 11686/20825, or about 0.56. Why does the
knowledge of which ace affect the answer?
Mr. Gardner goes on to say that you can analyze
this situation by considering the much simpler but
similar problem of dealing a hand of two cards from a
"deck" consisting of only four cards, namely the Ace of
Spades, Ace of Hearts, Jack of Diamonds, and two of
Clubs. There are only 6 possible two-card hands from
this deck. Five of them have at least one ace and in
only one of these five is there a second ace. Three of
them contain the Ace of Spades, and one of these three
contains a second ace. So in this illustration the two
answers would be 1/5 and 1/3! (Weaver, 1963, p. 363)
This problem is sufficiently difficult that Weaver got it
wrong in his first edition. However, you can quickly arrive at
the answer by playing the game by selecting a pair of random
numbers. Set "1" = ace of hearts, "2" = ace of spades, "3" and
"4" = non-aces (and to save time, using comparable numbers "5-8"
as well, and ignoring "9" and "0" and duplicate numbers; columns
1 and 2 in Table 4).
Considering the first situation Weaver mentions (where you
announce you have an ace), mark the cases "yes" or "no" (column 3
in Table 4). Now go back through the same set of random numbers
but considering only those where you have a ace of spades (column
4). You immediately see that this excludes some of the cases you
have previously marked "no" but none that you marked "yes".
Hence the probability of a "yes" in the second case must be
greater - simply because there are fewer cases that qualify to be
either "yes" or "no".
Table 4
The aces problem usually becomes clear as you simulate,
though the logic is demonstrably difficult when done abstractly.
However, the "bottom line" (except to professional
mathematicians) is not whether you come to understand the mathe-
matical logic but whether you can arrive at the right answer.
And if you present to a group some problems like this and the
preceding ones (or others such as the birthday problem), after
you have shown people how to simulate the answer to one or two of
the problems, it will be quite clear that they are more able to
deal correctly with subsequent problems than if they only
practice handling such problems deductively.
Meeting Under the Clock
Two persons agree to arrive at the town clock sometime
tomorrow at some minute between 1PM and 2PM, and to remain for 20
minutes. What is the probability that they will be there at the
same moment? Reasoning this out is not easy. But simulation
with 60 pieces of paper in a hat is easy.
(This problem is the canon of many queuing problems, of
course. See, e.g., Singh, 1972, pp. 177-178.)
The "Stop After A Boy" Rule of Family Formation
Consider the behavioral rule of couples continuing to
produce children until they have one boy. Many people's
intuition will be that this rule will lead to a different sex
ratio than another behavioral rule. Logical argument may or may
not be successful in persuading a person to the contrary. But
simulating this rule by proceeding sequentially through the
numbers in a random number chart soon convinces all that the rule
has no effect.
The Game of 26
Back in the days before it was banned by the mayor, you
could go into a Chicago bar and play the game of 26. Therein,
the player picks a number from "1" to "6", and wins if that
"point" comes up 26 or more times in 130 rolls of the dice (in
practice, ten dice rolled thirteen times).
As stated, this is a problem in pure probability. Or if a
given sample of results is observed, a similar situation can be
framed as a canonical problem in statistics -- i.e. whether the
observed sample is likely to have arisen by chance.
The calculations are sufficiently laborious that, in the
days before personal computers, both John Scarne (author of the
famous John Scarne on Dice) and Martin Gardner (who for years
wrote the mathematical games column for Scientific American) gave
up on the job. The house usually paid off at 4 to 1, but Scarne
could do no better than quote a gambling supply firm saying that
"the percentage in favor of the operator is...17 per cent...but I
have a hunch they're guessing."
Despite the difficulty of formulaic calculation, by repeated
trials the game operators obviously obtained reasonable estimates
of the odds. And the following short and intuitive computer
program (using the language RESAMPLING STATS, but APL or
MATHEMATICA would also be similar) quickly gives the answer.
REPEAT 15000
GENERATE 130 1,6 A create 130 random numbers, "1-6"
COUNT A =1 B count the "points"
SCORE B Z keep track of the trial's result
END
COUNT Z >=26 K count the number of results equal or
greater than 26.
Result: Between 18 and 19 percent "success", not far from
the bar odds.
THE SECRET OF SIMULATION'S POWER
The intellectual advantage of simulation in general, and of
the resampling method in statistics, is that though it takes
repeated samples from the sample space, it does not require that
one know the size of the sample space, or the number of points in
a particular partition of it. To calculate the likelihood of
getting (say) 26 "4"s in 130 dice throws with the binomial
formula requires that one calculate the total number of possible
permutations of 130 dice, and the number of those permutations
that include 26 or more "4"s. Gaussian distribution-based
methods often are used to approximate such processes when sample
sizes become large, which introduces another escalation in the
level of intellectual difficulty and obscurity for persons who
are not skilled professional mathematicians. In contrast, with a
simulation approach one needs to know only the conditions of
producing a single "4" in one roll.
Indeed, it is the much lesser degree of intellectual
difficulty which is the secret of simulation's success, because
it improves the likelihood that the user will arrive at a sound
solution to the problem at hand - which I hope that you will
agree is the ultimate criterion.
One may wonder how simulation performs this economical trick
of avoiding the complex abstraction of sample-space calculations.
The explanation is that it substitutes the particular information
about how elements in the sample are generated in that specific
case, as derived from the facts of the actual circumstance; the
analytic method does not use this information.
Recall that Galileo solved the problem of why three dice
yield "10" and "11" more frequently than "9" and "12" by
inventing the concept of the sample space. He listed all 256
possible permutations, and found that there are 27 permutations
that yield "10" and "11", but only 25 that yield "9" and "12".
But it took a Galileo to perform this intellectual feat, simple
as it seems to professionals now; lesser theorists had made the
error of assuming that the probabilities of all four numbers are
the same because the same number of combinations yields all four
numbers.
That is, for the gamblers before Galileo - who from long
experience had correctly determined that "10" and "11" had the
higher chances of success than "9" and "12" - simulation used the
assumed facts that three fair dice are thrown with an equal
chance of any outcome. They then took advantage of the results
of a series of such events, performed one at a time
stochastically; in contrast, Galileo made no use of the actual
stochastic element of the situation, and did not gain information
from a sequence of such trials; instead, he replaced all
possible sequences by computation of their number (actually, a
complete enumeration).
Simulation is not "just" a stochastic shortcut to the
results of the formulaic method. Rather, it is a quite different
route to the same endpoint, using different intellectual
processes and utilizing different sorts of inputs. As a partial
analogy, it is like fixing a particular fault in an automobile
with the aid of a how-to-do-it manual's checklist, compared to
writing a book about the engineering principles of the auto; the
author may be no better at fixing cars than the hobbyist is at
writing engineering books, but the hobbyist requires less time to
learn the task to which he addresses him\herself -- fixing the
car's fault -- than the author needs to learn to write learned
works.
SUMMARY
Simulation - and its sub-class of resampling problems in
statistics - is a much simpler task intellectually than the
formulaic method of probabilistic calculations because it does
not require that one calculate a) the number of points in the
entire sample space, and b) the number of points in some sub-set,
so as to estimate the ratio of the latter to the former.
Instead, one directly estimates the ratio from a sample. Even
slightly difficult problems involving permutations and
combinations are sufficiently difficult as to require advanced
training. Similarly, it is much easier to sample the proportion
of black grains of sand on a beach than it is to take a census of
the total number of grains of each color. The latter is a task
both of great intellectual difficulty as well as great practical
difficulty.
REFERENCES
Kruskal, Kruskal, correspondence, 1984
Simon, J. L., (1969), and third edition (1985) with P.
Burstein, Basic Research Methods in Social Science (New York:
Random House, 1969).
Scarne, John, Scarne on Dice, xerox of page 356 sent to me
with correspondence from Martin Gardner, November 16, 1991 or
December 10, 1991. Book not available at most libraries.
vos Savant, Marilyn, letter to Martin Gardner, November 8,
1991, contained both the beagles problem and the letters she
planned to print, including 5 from PhDs.
vos Savant, Marilyn, "Ask Marilyn", Parade, September 9,
1990a, p. 15.
vos Savant, Marilyn, "Ask Marilyn", Parade, December 2,
1990b, p. 25. vos Savant, Marilyn, "Ask Marilyn", Parade, February 17,
1991a, p. 12.
vos Savant, Marilyn, "Ask Marilyn", Parade, July 7, 1991b,
p. 28.
Simon, Julian L.; Atkinson, David T., and Shevokas, Carolyn,
"Probability and Statistics: Experimental Results of a Radically
Different Teaching Method", American Mathematical Monthly, vol.
83, no. 9, Nov. 1976, pp. 733-739.
Singh, Jagjit, Great Ideas of Operations Research (New
York: Dover, 1972).
Weaver, Warren, Lady Luck: The Theory of Probability (New
York: Anchor Books, 1963).
TABLE 2
SAMPLE SPACE ANALYSIS OF MONTY HALL PROBLEM
Location
of Your Host Winning
Prize Choice Opens Strategy
1 1 2 or 3 Remain
1 2 3 Switch
1 3 2 Switch
2 1 3 Switch
2 2 1 or 3 Remain
2 3 1 Switch
3 1 2 Switch
3 2 1 Switch
3 3 1 or 2 Remain
ABSTRACT
WHAT SOME PUZZLING PROBLEMS TEACH
ABOUT THE THEORY OF SIMULATION AND THE USE OF RESAMPLING
Julian L. Simon
Simulation is simpler intellectually than the formulaic
method because it does not require that one calculate a) the
number of points in the entire sample space, and b) the number of
points in some sub-set. Instead, one directly samples the ratio.
This article presents probabilistic problems that confound even
skilled statisticians when attacking the problems deductively,
yet are easy to handle correctly, and become clear intuitively,
with physical simulation. This analogy demonstrates the useful-
ness of simulation in the form of resampling methods.
AFTERNOTE TO CHAPTER III-4
AN EXTENDED PUZZLE EXAMPLE ILLUSTRATING BAYES' LAW
Consider this puzzle by a great logician. There are two
balls in an urn, one white and the other either white or black.
You draw a white and replace it. What is the chance that the next
draw will be white?
Let us first note that this seems to be a problem in
"statistics" rather than in "probability", in the sense that we
are asking questions about the identity of an unknown generating
universe, rather than about the behavior of a known generating
universe. And it is a question in the tradition of Bayes'
formula, which is one of the great and deep (though simple)
mathematical and statistical ideas of all time.
It is easy to confuse two related cases - a) the probability
that if the first ball is not replaced, the next ball drawn would
be white, and b) the probability that the balls in the urn are
both white, or one white (W) and one black (B). We will consider
both of them.
First let us ask about the probability that the very first
ball drawn from the urn will be W. One way to estimate this is
to create two urns - one with WW and the other with WB, and draw
randomly from the two with equal probability - an assumption that
we now make explicit.
1. One urn with WW and the other with WB.
2. Draw randomly from the two urns with equal probability
(by flipping a coin, say). Record the result as "B" or "W", "0"
or "1".
3. Repeat (2) a hundred times.
4. Count proportion "W".
After doing this simple experiment, it may be useful to draw
out the explanation for the (close to) .75 proportion
(probability) we observe. In half the cases the coin flip gives
us the WW urn (whose identity we never observe directly), in
which case we always draw and observe W; hence half the total
observations are W from that source alone. In the other half of
the flip-determined drawings we get the WB urn, and then (on
average) half the observations are W and half B. So a total of
(1/2 * 1) + (1/2 * 1/1) = .75 are W.
Next suppose we draw one counter and find it W. Now we ask
the probability that the next counter from the same urn will be
W. (Note how this differs from the puzzle which asks about the
probability that both are white). Whereas before we could specify
the universe - two urns, WW and WB - now we can no longer specify
the universe we are referring to; it is either WW or WB. This
transition in thinking is subtle, and you could think about the
matter some other way if you wish, but I believe that this
transition is important because it brings in Bayes' formula for
the first time, and marks the change from pure probability to a
statistical outlook.
We can proceed experimentally:
3a. Following steps (1-2) above: If the counter drawn is
"B", end the trial without any record and start again; otherwise
proceed with this trial. It is this fork in the path that makes
the problem puzzling because it happens almost unnoticed.
4a. If "W" is drawn in step (3a), draw another counter.
Record "B" or "W".
5. Repeat (2-4) perhaps 100 times and count the proportion
"B". This gives the answer to the original problem for a total
of two counters drawn (or, one after the original counter).
We can perform this experiment with a simple computer
program:
We estimate that the probability sought is about p = 2/3
(??).
We can ask another related question at this point: What is
the probability that the urn we are drawing from contains WW?
This sounds more like a statistics or "inverse probability"
question than before. Earlier we saw that a .5 chance of WW and a
.5 chance of WB produces a .75 chance of the first counter being
W. It is intuitively clear that after the first counter is drawn
and found to be W, the chance is greater than .5 that the urn is
WW. But how much greater?
One way to determine this probability is to draw again and
again and again from the same urn to see if the result is always
W or sometimes B (stopping after we get one B, of course, because
that settles the question immediately). We can't draw forever,
but we can certainly sneak up on forever by drawing (say) up to
25 times (or 100 if you like), stopping if we get B. So let's do
that:
6. Keep drawing from the urn that gave us one W until we
get a B and so record and then stop, or reach 25 W's and record
W.
7. Count the proportion W.
It is here that the Reverend Thomas Bayes gave us a formula
to do the previous task in an expeditious way - but a way that is
very subtle and deep. He asks the question: How great is the
probability that a WW urn will produce one W, compared to the
probability of a WB urn? The answer clearly is 1.0 versus .5, or
twice as likely. So P > 1.0/(1.0 + .5) = 2/3. (Notice how this is
different than the earlier .5 chance that it will be either urn
that will be picked.)
We than reverse field and say: If we observe a single W,
there is twice as great a chance that it came from a WW urn than
from a WB urn. This crucial and mind-bending reversal in
direction of thought is Bayes' contribution.
Notice also how we proceeded from asking the probabilistic
question about the likelihoods of WW and WB urns to produce a W,
to asking the likelihood that if we get a W it came from a WW
urn. So simple a change, but so confusing. This is why
statistics is such a difficult subject - not because of complex
mathematics, but because of the subtle changes in direction of
thinking that are required (as well as because of the large
number of issue that must be dealt with modeling and making other
judgments while proceeding from the original question to the
final interpretation). As many times as I have worked with and
taught the Bayes' simple little Rule, I forget it and then have
to think it through anew; I find it challenging every time.
Bayes reasoned (whereas we experimented above to get the
answer) that if the chance of WW is twice that of WB (after
getting one W), then we can calculate that two out of three times
(on average) the urn that produces a W will be WW. Read this
paragraph about a dozen times to see if you follow Bayes' logic;
he proceeds from absolute probabilities for each urn - 1.0 and.5
- to relative probabilities. The relative probability for WW is
1.0/(1.0 + .5) = 2/3, and for WB it is .5/(1.0 + .5) = 1/3. The
heart of Bayes' invention is this shift from absolute to relative
probabilities. And this can only be done by removing from
consideration all those cases which are initially B. That is, we
ignore the .25 of all cases in which the first counter is B, and
consider only the .75 of all cases which are W, finding that in
.5/.75 of the cases the urn is WW, and in .25/.75 it is WB. Just
as with the puzzle, it is this fork in the path at which we
proceed by ignoring some of the cases that makes both the puzzle
and Bayes' thinking confusing, I believe.
Let us digress to compare the situation at hand with the
statistical procedure known as hypothesis testing. If someone
claims to be able to increase the proportion of female calves
upwards from .5, and an experiment produces 9 of 10 females, you
"test the hypothesis" that the treatment works by asking the
absolute probability that a .5 universe (or urn) will produce 9
of 10 females. If that probability is low, you conclude that the
sample did not come from that universe. But you do not conclude
that the sample came from a .9 universe, even though that might
be your best guess; rather, you conclude that the sample came
from a universe greater than .5. So the "alternative hypothesis"
is vague in this case - "greater than .5" - whereas in the white-
black ball problem the alternatives are both sharp - either WW or
WB.
The use of Bayes' formula always requires sharply stated
alternatives. It is that requirement that distinguishes it from
hypothesis testing, and that gives it its power when that
requirement is met (as it seldom is in science except in
genetics), though it is frequently met with in medical diagnosis
and sometimes in industrial quality control. The two modes of
thinking also are different in that one never puts aside one set
of cases in testing hypotheses.
A few paragraphs earlier I used the words "came from" in
describing the connection between an observation and a universe.
Nowhere else in statistical inference is it reasonable or
justified to use this term, in my view; rather, it is only
warranted to speak about the samples that a given universe
"produces". The reason for using "came from" - and Lewis Carroll
(from whom I adapted this problem) uses those words, as well as
writers on Bayesian techniques - is that the situation is framed
extremely narrowly in situations to which we apply Bayes' Law.
We allow there to be no other alternative source than those
analyzed, and we allow of no other possible outcomes than the
ones permitted by the initial statement of the facts of the
situation. By this sharp delineation, we convert the inference
to a situation that is as close to deduction as is logically
possible. Hence there is no room for a logical or empirical
break between the relative probabilities of two (or more)
universes producing a particular sample and the probabilities
that the sample came from the various stated probabilities. It
is, of course, infrequent that situations in the world match
these characteristics, but in some cases - such as whether or not
a person should be considered to have a given disease - the
application of Bayes' Law, and the use of the term "came from"
may be quite warranted.
Let's now go a bit further with Bayes' Rule, making the
probability of a WW urn equal .2 and that of a WB urn .8 - the
equivalent of one WW and four WB urns to be chosen from randomly.
These are now explicitly "prior probabilities" for Bayes-type
thinking. (The particular probabilities were chosen so they will
not be confused with other probabilities in the problem.)
It is easy to see, or to arrive at with experimentation,
that before the first counter is drawn the probability of a W is
.6; there are the equivalent of a total of six W and 4 B with an
equal 1/5 * 1/2 of being chosen. What about after a W is
observed on the first draw?
The probability that a W will be produced by a WW universe
is 1/5 * 1 = .2, while the probability that a W will be produced
by a WB universe is 4/5 * 1/2 = .4. So the probability that the
observed W came from a WW universe is .2/(.2 + .4) = 1/3,
according to Bayes' Rule. From this we can compute the
probability that a W will be observed next time by substituting
1/3 and 2/3 as the "prior probabilities" for the two urns. [And
we can continue on in this fashion for any number of observed
W's.??? No.]
One can get the desired probabilities directly by
experimentation without knowing Bayes' Law. The key modeling
step is excluding a trial from consideration, without any record,
if it produces a W before the sequence reaches the successive W
about which you are presently inquiring.