CHAPTER III-5 UPDATING SUBJECTIVE PROBABILITIES WITH SIMULATION: FROM PEDAGOGY TO PRACTICE TO JEFFREY'S RULE TO PUZZLES INTRODUCTION The aim of this chapter is to show that simulation can be a helpful and illuminating way to approach problems in Bayesian analysis. Simulation has two valuable properties for Bayesian analysis: 1) It can provide an effective way to handle problems whose analytic solution may be difficult or impossible. 2) Simulation can provide insight to problems that otherwise are difficult to understand fully, as is peculiarly the case with Bayesian analysis. The chapter therefore presents examples ranging from 1) the simplest pedagogy to 2) the complexities of updating Bayesian probabilities in statistical practice to 3) clarifying philosophical problems such as Jeffrey's Rule, and 4) the unmasking of a non-problem by Lewis Carroll. Philosopher Charles Sanders Peirce is paraphrased as saying that "in no other branch of mathematics is it so easy for experts to blunder as in probability theory" (Gardner, 1961, p. 220)1. Even great mathematicians have blundered on simple problems, including D'Alembert and Leibniz. This observation is especially true of Bayesian problems, as much recent study in cognitive psychology has shown (for a summary, see Piattelli-Palmarini, 1994). When psychologists employ probability puzzles showing how people err, these puzzles almost invariably are problems in conditional probability and Bayesian analysis (and Feller [1968, p. 124] insists that Bayesian analysis be seen as an exercise in Bayesian probability). All but the simplest problems in conditional probability are confusing to the intuition even if not difficult mathematically. But when one tackles Bayesian and other problems in probability with experimental simulation methods rather than with logic, neither simple nor complex problems need be difficult for experts or beginners. THE SIMPLEST PEDAGOGICAL PROBLEMS To make clear the nature of Bayes' rule, let us start with the simplest sort of problem, and proceed gradually from there. 1. Assessing the Likelihood That a Used Car Will Be Sound. Consider a problem in estimating the soundness of a used car one considers purchasing (after Wonnacott and Wonnacott, 1990, p. 93). Seventy percent of the cars are known to be OK on average, and 30 percent are faulty. Of the cars that are really OK, the mechanic identifies 80 percent as "OK" but says that 20 percent are "faulty"; of those that are faulty, the mechanic correctly identifies 90 percent as faulty and says (incorrectly) that 10 percent are OK. We wish to know the probability that if the mechanic says a car is "OK", it really is OK. One can get the desired probabilities directly by simulation without knowing Bayes' Law, as we shall see. But one must be able to model the physical problem correctly in order to proceed with the simulation; this requirement of a clearly-visualized model is a strong point in favor of simulation. The following steps determine the probability that a car said to be "OK" will turn out to be really faulty: 1. Model in percentages the universe of all cars as an urn of 100 balls. Working from the data as given above, and referring to first principles, color (.9 * .3 =) .27 of the 100 balls (the 27 faulty balls said to be "faulty") violet, (.1 * .3 =) .03 of the 100 balls blue (3 balls said "OK" but faulty), .2 * .7 =) .14 of the balls (14 OK cars said to be "faulty" balls) orange, and (.8 * .7 = ) 56 balls (said to be "OK" that really are OK) maroon. A Venn diagram may help with this step, but it is not necessary. Even better procedure would be to work directly from the original data. One would note, for example, that of 200 cars previously observed, 54 were faulty and were said to be "faulty", 6 were faulty and were said to be "OK", 28 were OK but were said to be "faulty", and 112 were OK and were said to be "OK". Then make an urn of 54 violets, 6 blue, 28 orange, and 112 maroon.1 2. Draw a ball. If it is one of those said to be "faulty" - that is, violet or orange - draw (with replacement) another ball. Continue until obtaining a ball said to be "OK" - that is, a blue or maroon ball. Then record its color. -------------------- 1. The use of percentages rather than raw numbers in Bayesian problems is an unnecessary abstraction and is often misleading, in addition to being a hindrance in modeling for simulation. Indeed, thinking of the prior experience as a dis- tribution of data rather than as a probability distribution is both closer to the facts and less confusing in complex situa- tions, as will be seen in a later example. 3. Repeat step 2 perhaps 1000 times and compute the proportion of blues among the recorded results. OR 1. Choose a number randomly between "1" and "100". 2. If "28-30" or "31-45:, record; otherwise draw another number, and repeat until a number is recorded. 3. Repeat step 2 and count the proportion "28-30" among the total "28-45". The key modeling step is excluding a trial from consideration (without making any record of it) if it produces an irrelevant observation, and continuing to do so until the process produces an observation of the sort about which you are presently inquiring. Using RESAMPLING STATS, an answer may be produced as follows: "01 - 27" = actually faulty, said to be "faulty" "28 - 30 = faulty, "OK" "31 - 86" = OK, "OK" "87 - 100" = OK, "faulty" REPEAT 1000 do 1000 repetitions GENERATE 1 1,100 a generate a number between "1" and "100" IF a between 28 86 if it's between "28" and "86" (those that say "good") SCORE a z score this number END end the IF condition END end REPEAT loop COUNT z between 28 30 k how many of the SCORED numbers were between "28 - 30" (faulty, "OK") SIZE z s how many numbers were scored DIVIDE k s kk what proportion were faulty, "OK" PRINT kk print result Result kk = 0.039 2. Estimating Driving Risk for Insurance Purposes Another sort of introductory problem, following after Feller (1968, p. 22): A mutual insurance company charges its members according to the risk of having an auto accident. It is know that there two classes of people - 80 percent of the population with good driving judgment and with a probability of .06 of having an accident each year, and 20 percent with poor judgment and a probability of .6 of having an accident each year. The company's policy is to charge (in addition to a fee to cover overhead expenses) $100 for each percent of risk, i. e. a driver with a probability of .6 should pay 60*$100 = $6000. If nothing is known of a driver except that he had an accident last year, what fee should he pay? This procedure will produce the answer: 1. Construct urn A with 6 red and 94 green balls, and urn B with 60 red and 40 green balls. 2. Randomly select an urn with probabilities for A = .8 and B = .2, and record the urn chosen. 3. Select a ball at random from the chosen urn. If the ball is green, repeat this step; if red, continue to step 4. In either case, replace the ball selected. 4. Select another ball from the urn chosen in step 2. If it is red, record "Y", if green, record "N". 5. Repeat steps 2 - 4 perhaps 1000 times, and determine the proportion "Y". The final answer should be approximately 42*$100 = $4200. 3. Screening for Disease This is a classic Bayesian problem quoted by Tversky and Kahnemann, (1982, pp. 153-154, from Cascells, Schoenberger, and Grayboys, 1978, p. 999): If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the persons's symptoms or signs? Tversky and Kahnemann note that among the respondents - students and staff at Harvard Medical School, "The most common response, given by almost half of the participants, was 95%" - very much the wrong answer. To obtain an answer by simulation, we may rephrase the question above with (hypothetical) absolute numbers as follows: If a test to detect a disease whose prevalence has been estimated to be about 100,000 in the population of 100 million persons over age 40 (that is, about 1 in a thousand) has been observed to have a false positive rate of 60 in 1200, and never gives a negative result if a person really has the disease, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the persons's symptoms or signs? If the raw numbers are not available, the problem can be phrased in such terms as "about 1 case in 1000" and "about 5 false positives in 100 cases".) One may obtain an answer as follows: 1. Construct urn A with 999 white beads and 1 black bead, and urn B with 95 green beads and 5 red beads. A more complete problem that also discusses false negatives would need a third urn. 2. Pick a bead from urn A. If black, record "T", replace the bead, and end the trial. If white, continue to step 3. 3. If a white bead is drawn from urn A, select a bead from urn B. If red, record "F" and replace the bead, and if green record "N" and replace the bead. 4. Repeat steps 2-4 perhaps 10,000 times, and in the results count the proportion of "T"s to ("T"s plus "F"s) ignoring the "N"s). Of course 10,000 draws would be tedious, but even after a few hundred draws a person would be likely to draw the correct conclusion that the proportion of "T"s to ("T"s plus "F"s) would be small. And it is easy with a computer to do 10,000 trials very quickly. Note that the respondents in the Cascells et al. study were not naive; the medical staff members were supposed to understand statistics. Yet most doctors and other personnel offered wrong answers. If simulation can do better than the standard deductive method, then simulation would seem to be the method of choice. And only one piece of training for simulation is required: Teach the habit of saying "I'll simulate it" and then actually doing so. FUNDAMENTAL PROBLEMS IN STATISTICAL PRACTICE Box and Tiao begin their classic exposition of Bayesian statistics with the analysis of a famous problem first published by Fisher (1959). ...there are mice of two colors, black and brown. The black mice are of two genetic kinds, homozygotes (BB) and heterozygotes (Bb), and the brown mice are of one kind (bb). It is known from established genetic theory that the probabilities associated with offspring from various matings are as [in Table 1]: Suppose we have a "test" mouse which is black and has been produced by a mating between two (Bb) mice. Using the information in the last line of the table, it is seen that, in this case, the prior probabilities of the test mouse being homozygous (BB) and heterozygous (Bb) are precisely known, and are 1/3 and 2/3 respectively. Given this prior information, Fisher supposed that the test mouse was now mated with a brown mouse and produced (by way of data) seven black offspring. One can then calculate, as Fisher (1959, p.17) did, the probabilities, posterior to the data, of the test mouse being homozygous (BB) and heterozygous (Bb) using Bayes' theorem... We see that, given the genetic characteristics of the offspring, the mating results of 7 black offspring changes our knowledge considerably about the test mouse being (BB) or (Bb), from a prior probability ratio of 2:1 in favor of (Bb) to a posterior ratio of 64:1 against it (1973, pp. 12- 14) . TABLE 1 PROBABILITIES FOR GENETIC CHARACTER OF MICE OFFSPRING _______________________________________________________________________ Mice BB (black) Bb (black) bb (brown) _______________________________________________________________________ BB mated with bb 0 1 0 Bb mated with bb 0 1/2 l/2 Bb mated with Bb 1/4 1/2 1/4 _______________________________________________________________________ Source: Box and Tiao, 1973, pp. 12-14 1. Let us begin, as do Box and Tiao, by restricting our attention to the third line in Table 25-1, and let us represent those results with 4 balls - 1 black with "BB" painted on it, 2 black with "Bb" painted on them, and 1 brown which we immediately throw away because we are told that the "test mouse" is black. The remaining 3 (black) balls are put into an urn labeled "test". 2. From prior knowledge we know that a BB black mouse mated with a bb brown mouse will produce all black mice (line 1 in the table), and a Bb black mouse mated with a bb brown mouse will produce 50 percent black mice and 50 percent brown mice. We therefore construct two more urns, one with a single black ball (the urn labeled "BB") and the other with one black ball and one brown ball (the urn labeled "Bb"). We now have three urns. 3. Take a ball from urn "test". If its label is "BB", record that fact, take a ball (the only ball, which is black) from the BB urn, record its color (we knew this already), and replace the ball into the BB urn; the overall record of this trial is "BB-black". If the ball drawn from urn "test" says "Bb", draw a ball from the Bb urn, record, and replace; the record will either be "Bb-black" or "Bb-brown". 4. Repeat step 3 seven times. 5. Examine whether the record of the seven balls drawn from the BB and Bb urns are all black; if so, record "Y", otherwise "N". 6. Repeat steps 3-5 perhaps 1000 times. 7. Ignore all "N" records. Proceeding now if the result of step 5 is "Y": Count the number of cases which are BB and the number which are Bb. The proportions of BB/"Y" and Bb/"Y" trials are the probabilities that the test mouse is BB and Bb respectively. Creating the correct simulation procedure is not easy, because Bayesian reasoning is very subtle - a reason it has been the cause of much controversy for more than two centuries. But it certainly is not easier to create a correct procedure using analytic tools (except in the cookbook sense of plug in and pray). And the difficult mathematics that underlie the analytic method (see e. g., Box and Tiao, Appendix A1.1) make it almost impossible for the statistician to fully understand the procedure from beginning to end; if one is interested in insight, the simulation procedure might well be preferred.2 A computer program to speed the above steps appears in the Appendix. The result found with a set of 1000 repetitions is .987. PROBLEMS BASED ON NORMAL AND OTHER DISTRIBUTIONS1 Much of the work in Bayesian analysis for scientific purposes treats the combining of prior distributions having Normal and other standard shapes with sample evidence which may also be represented with such standard functions. The mathematics involved often is formidable, though some of the calculational formulae are fairly simple and even intuitive. These problems may be handled with simulation by replacing the Normal (or other) distribution with the original raw data -------------------- 1. This section represents work done jointly with Ekaterina Kamushadze and Peter C. Bruce. when data are available, or by a set of discrete sub-universes when distributions are subjective. Measured data from a continuous distribution present a special problem because the probability of any one observed value is very low, often approaching zero, and hence the probability of a given set of observed values usually cannot be estimated sensibly; this is the reason for the conventional practice of working with a continuous distribution itself, of course. But a simulation necessarily works with discrete values. A feasible procedure must bridge this gulf. The logic for a problem of Schlaifer's will be only be sketched out, to be described at length in another publication. The procedure is rather novel, but it has not heretofore been published and therefore must be considered tentative and requiring particular scrutiny. An Intermediate Problem in Conditional Probability Schlaifer employs a quality-control problem for his leading example of Bayesian estimation with Normal sampling. A chemical manufacturer wants to estimate the amount of yield of a crucial ingredient X in a batch of raw material in order to decide whether it should receive special handling. The yield ranges between 2 and 3 pounds (per gallon), and the manufacturer has compiled the distribution of the last 100 batches. The manufacturer currently uses the decision rule that if the mean of nine samples from the batch (which vary only because of measurement error, which is the reason that he takes nine samples rather than just one) indicates that the batch mean is greater than 2.5 gallons, the batch is accepted. The first question Schlaifer asks, as a sampling-theory waystation to the more general question, is the likelihood that a given batch with any given yield - say 2.3 gallons - will produce a set of samples with a mean as great or greater than 2.5 gallons. We are told that the manufacturer has in hand nine samples from a given batch; they are 1.84, 1.75, 1.39, 1.65, 3.53, 1.03, 2.73, 2.86, and 1.96, with a mean of 2.08. Because we are also told that the manufacturer considers the extent of sample variation to be the same at all yield levels, we may - if we are again working with 2.3 as our example of a possible universe - therefore add (2.3 - 2.08 =) 0.22 to each of these nine observa- tions, so as to constitute a bootstrap-type universe; we do this on the grounds that this is our best guess about the constitution of that distribution with a mean at (say) 2.3. We then repeatedly draw samples of nine observations from this distribution (centered at 2.3) to see how frequently its mean exceeds 2.5. This work is so straightforward that we need not even state the steps in the procedure. Estimating the Posterior Distribution Next we estimate the posterior distribution. Figure 1 shows the prior distribution of batch yields, based on 100 previous batches. Figure 1 Notation: Sm = set of batches (where total S = 100) with a particular mean m (e. g. m = 2.1). xi = particular observation (e. g. x3 = 1.03). s = the set of xi. We now perform for each of the Sm (categorized into the tenth-of-gallon divisions between 2.1 and 3.0 gallons), each corresponding to one of the yields ranging from 2.1 to 3.0, the same sort of sampling operation performed for Sm=2.3 in the previous problem. But now, instead of having regard to the manufacturer's decision criterion of 2.5, we construct an inter- val of arbitrary width around the sample mean of 2.08 - say at .1 intervals from 2.03 to 2.13 - and then work with the weighted proportions of sample means that fall into this interval. 1. Using a bootstrap-like approach, we presume that the sub-universe of observations related to each Sm equals the mean of that Sm - (e.g 2.1) plus (minus) the mean of the xi (equals 2.05) added to (subtracted from) each of the nine xi, e. g. 1.03 + .05 = 1.08. For a distribution centered at 2.3, the values would be (1.84+.22=1.96, 1.75+.22=1.87...). 2. Working with the distribution centered at 2.3 as an example: Constitute a universe of the values (1.84+.22=1.96, 1.75+.22=1.87...). Here we may notice that the variability in the sample enters into the analysis at this point, rather than when the sample evidence is combined with the prior distribution; this is in contrast to conventional Bayesian practice where the posterior is the result of the prior and sample means weighted by the reciprocals of the variances (see e.g. Box-Tiao, 1973, p. 17 and Appendix A1.1). 3. Draw nine observations from this universe (with replacement, of course), compute the mean, and record. 4. Repeat step 2 perhaps 1000 times and plot the distribution of outcomes. 5. Compute the percentages of the means within (say) .5 on each side of the sample mean, i. e. from 2.03 - 2.13. The resulting number - call it UPi - is the un-standardized (un- normalized) effect of this sub-distribution in the posterior distribution. 6. Repeat steps 1-5 to cover each other possible batch yield from 2.0 to 3.0 (2.3 was just done). 7. Weight each of these sub-distributions - actually, its UPi - by its prior probability, and call that WPi -. 8. Standardize the WPis to a total probability of 1.0. The result is the posterior distribution. The value found is 2.283, which the reader may wish to compare with a theoretically- obtained result (which Schlaifer does not give). This procedure must be biased because the numbers of "hits" will differ between the two sides of the mean for all sub- distributions except that one centered at the same point as the sample, but it is as-yet unknown what are the extent and properties of this bias. The bias would seem to be smaller as the interval is smaller, but a small interval requires a large number of simulations; a satisfactorily narrow interval surely will contain relatively few trials, which is a practical problem of still-unknown dimensions. Another procedure - less theoretically justified and probably more biased - intended to get around the problem of the narrowness of the interval, is as follows: 5a. Compute the percentages of the means on each side of the sample mean, and note the smaller of the two (or in another possible process, the difference of the two). The resulting number - call it UPi - is the un-standardized (un-normalized) weight of this sub-distribution in the posterior distribution. Another possible criterion - a variation on the procedure in 5a - is the difference between the two tails; for a universe with the same mean as the sample, this difference would be zero. The subject of this section has only been touched on for lack of space, but more such problems, along with facilitating computer programs, are available upon request. SOLVING BAYESIAN PROBABILITY PUZZLES WITH SIMULATION Several illustrative puzzles at whose heart is conditional probability - including the famous Monty Hall "Let's Take a Chance" three-door problem - appeared earlier in this journal (Simon, 1994). Now let us consider a problem that Piattelli- Palmarini (1994) considers a canonical "illusion" in probability, and this time it will not only be dealt with by simulation, but the psychological difficulty of solving the problem analytically will be set forth. Here is Samuel Goldberg's version of the problem that Joseph Bertrand posed early in the 19th century. "Three identical boxes each contain two coins. In one box both are pennies, in [the second both are nickels, and in the third there is one penny and one nickel. A man chooses a box at random and takes out a coin. If the coin is a penny, what is the probability that the other coin in the box is also a penny?" Another way to phrase the same problem - with more dramatic detail, which apparently makes the problems more difficult: A Spanish treasure fleet of three ships was sunk at sea off Mexico in the 1500s. One ship had a trunk of gold forward and another aft, another ship had a trunk of gold forward and a trunk of silver aft, while a third ship had a trunk of silver forward and another trunk of silver aft. A scuba diver just found one of the ships and a trunk of gold in it, but she ran out of air before she could check the other trunk. On deck, they are now taking bets about whether the other trunk found on the same ship will contain silver or gold. What are fair odds that the trunk will contain gold? These are the steps in a simulation that would answer the question: 1. Create three urns each containing two balls labeled "0,0", "0,1", and "1,1" respectively. 2. Choose an urn at random, and shuffle its contents. 3. Choose the first element in the chosen urn's vector. If "1", stop trial and make no further record. If "0", continue. 4. Record the second element in the chosen urn's vector on the scoreboard. 5. Repeat (2 - 5), and calculate the proportion "0's" on scoreboard. Though an analogous computer simulation is shown in the Appendix, what makes this problem interesting is not the comparison of computer simulation to the formulaic approach, but rather the comparison of any simulation to ratiocination without calculation. The reason why pure thought alone so often leads to the wrong answer is that this deceptively-simple problem really is quite complex, requiring many twists and turns. These are the logical steps one may distinguish in arriving at a correct answer with deductive logic (portrayed in Figure 2): Figure 2 1. Postulate three ships - Ship I with two gold chests (G- G), ship II with one gold and one silver chest (G-S), and ship III with S-S. (Choosing notation might well be considered one or more additional steps.) 2. Assert equal probabilities of each ship being found. 3. Step 2 implies equal probabilities of being found for each of the six chests. 4. Fact: Diver finds a chest of gold. 5. Step 4 implies that S-S ship III was not found; hence remove it from subsequent analysis. 6. Three possibilities: 6a) Diver found chest I-Ga, 6b) diver found I-Gb, 6c) diver found II-Gc. From step 2, the cases a, b, and c in step 6 have equal probabilities. 7. If possibility 6a is the case, then the other trunk is I-Gb; the comparable statements for cases 6b and 6c are I-Ga and II-S. 8. From steps 6 and 7: From equal probabilities of the three cases, and no other possible outcome, p (6a) = 1/3, p (6b) = 1/3, p (6c) = 1/3, 9. So p(G) = p(6a) + p(6b) = 1/3 + 1/3 = 2/3. A key implication of the deservedly-famous research on errors in probabilistic judgments of Daniel Kahnemann and Amos Tversky (interchangeably, Tversky and Kahnemann) is that human thinking is often unsound. And some writers in their school of thought assert that the unsoundness of thinking is hard-wired into our brains; this point of view is expressed vividly in the title of Massimo Piattelli-Palmarini's book Inevitable Illusions; he calls the unsoundness "bias", and says that "we are instinctively very poor evaluators of probability" (1994, p. 3, italics in original). Another possibility - not necessarily inconsistent with genetic explanation - is that the reason we arrive at unsound answers to certain types of problems is that the problems are inherently very difficult, especially when they are tackled without the assistance of tools, because the problems require many steps and also because the steps often involve reversals in the path. Without the aid of memory aids such as paper and pencil, and the skill of using them well, the problems are just too difficult for most persons. One piece of evidence against the genetic-bias explanation is that the wrong answers to problems are not all the same; they do not even concentrate at one end of the probability spectrum. As the work of Kahnemann and Tversky amply shows, the errors often are widely distributed among most or all of the simple arithmetical combinations of the numbers involved in the problems. The outstanding characteristic of the answers is that they are wrong, and not the nature of the errors. In following long chains of logic and assessing complex assortments of information, our brains may be weaker than we would like, but we need not think of our brains as twisted. The two explanations have quite different implications for remediation, and two different remedies are offered; I suggest resorting to simulation whereas others suggest additional training (especially in probability theory) to improve people's logic. The different remedies are not necessarily connected to the two explanations, however; I believe that the remedy I suggest is implied by the bias explanation as well as by the weakness explanation. SIMULATING PHILOSOPHICALLY-DIFFICULT BAYESIAN PROBLEMS Another role for simulation in a Bayesian context is penetrating problems that are difficult technically or philosophically. This section presents two such examples. 1. Is Jeffrey's Rule of Any Use? Jeffrey's Rule is a system for updating subjective probabilities in light of additional information when the probabilities have not been previously quantified. Box and Tiao (1973, pp. 41-46) give a classic exposition, but perhaps the best way to understand the system is from examples - an implicit operational definition. Diaconis and Zabell (in Bell et al., p. 271) give as an example this problem from Whitworth (1901, pp. 167-68, Question 138): A, B, C were entered for a race, and their respective chances of winning were estimated at 2/11, 4/11, 5/11. But circumstances come to our knowledge in favour of A, which raise his chance to 1/2; what are now the chances in favour of B and C respectively? Answer. A could lose in two ways, viz. either by B winning or by C winning, and the respective chances of his losing in these ways were a priori 4/11 and 5/11, and the chance of his losing at all was 9/11. But after our accession of knowledge the chance of his losing at all becomes 1/2, that is, it becomes dimin- ished in the ratio of 18:11. Hence the chance of either way in which he might lose is diminished in the same ratio. Therefore the chance of B winning is now 4/11 x 11/18, or 4/18; and of C winning 5/11 x 11/18, or 5/18. These are therefore the required chances. This problem persuades Diaconis and Zabell of the sometime- incapacity of Bayes' rule. Yet a simulation solution to the problem at hand seems straightforward: 1. Put 2 Amber, 4 Black, and 5 Claret balls in an urn, for the original probabilities of A, B, and C respectively. 2. We wish to raise A's probability from 2/11 to 1/2 and then find the new probabilities of B and C without changing the relative probabilities of B and C; the necessity of making this latter assumption (or some other; I simply follow Whitworth in the restriction he places, rather than choosing one of my own) is forced upon our attention when we consider changing the composition of the marbles in the urn, and this forcing of attention to a key issue is one of the greatest benefits of a simulation approach. We therefore add 7 A's to the urn to make A's probability 9/18. If this is not crystal-clear intuitively, we can write formally (though understanding that the formalism is unnecessary to the main line of the discussion): A/T = 2/11 T = 11 A'/(T + x) = (2 + x)/(11 + x) = 1/2 where a prime on a variable means ex post the change. x = 7. 3. Estimate the new probabilities of B and C by repeated trials [though of course one could also calculate B/(T + x) = 4/18 and C/(T + x) = 5/18 to get p(B') and p(C')]. The direct solution with simulation suggests that there is no need for Jeffrey's or any other subtle analysis in this case. One might reply that the simulation illustrates Jeffrey's approach. But if the simulation method brings one directly to the solution without need for Jeffrey's analysis, what is the benefit of Jeffrey's analysis in this case? The point here is not to deny that the discussion by Diaconis and Zabell of the difficulties they see in the problem throws light on interesting philosophical and theoretical issues. And Jeffrey's Rule may be valuable in other contexts, though apparently it is not necessary here; identifying those contexts would be of interest. The point, rather, is to give one more instance of how simulation can often (even if not always) give simple and understandable solutions to apparently-difficult problems. Now consider the other numerical example presented in the same article by Diaconis and Zabell (p. 272): Suppose that in a criminal case we are trying to decide which of four defendants, called a, b, c, d, is a thief. We initially think P(a)=P(b)=P(c)=P(d)=1/4. Evidence is then introduced to show that the thief was probably left-handed. The evidence does not demonstrate that the thief was definitely left-handed, but it leads us to conclude the P(thief left-handed) =.8. If a and b are the defendants who are left- handed, then E1={a,b}, E2={c,d} and PH(E1)=.8, PH(E2)=.2 [where H stands for the probability in light of handedness]. If the only effect of the evidence was to alter the probability of left-handedness -- in the sense that P(A Ei)=PH((A Ei) -- then PH is obtained from Jeffrey's rule as PH(a)=.4, PH(b)=.4, PH(c)=.1, PH(d)=.1. Evidence is next presented that it is somewhat likely that the thief was a woman. If the female defendants are a and c, then F1={a,c}, F2={b,d}. If PHS(F1)=.7 [where S stands for the probability in light of sex] and Jeffrey-updating is again judged acceptable, then PEF(a)=.56, PHS(b)=.24, PEF(c)=.14, PHS(d)=.06. If instead the evidence (F1,.7), (F2,.3) is presented first and (E1,.8), (E2,.2) is presented second, is PSH equal to PHS? [This example] shows that in general the order matters since the currently held opinion governs; in this example the reader may check that the order does not matter. Now a simulation solution: 1. Put a total of 4 balls marked a, b, c, and d into an urn for the original state of belief. 2. On the assumption - forced by our decision about which balls to add to the urn (unless we explicitly choose to make some other assumption) - that the relative probabilities of a:b and c:d remain the same, add 3 a's and 3 b's to the urn, to make p(a + b) = .8. 3. Find the new probabilities of a, b, c and d by experiment (or by examination of the proportions of balls in the urn). 4. Continuing with the evidence relevant to the sex of the thief: On the assumption of constant relative probabilities a:c and b:d, and the same logic, add balls to make 14 a's, 6 b's, 56 c's, and 24 d's, which immediately produces the probabilities for each suspect. Again the simulation procedure suffices quite well without auxiliary logical rules. I have not yet found any reason to think that a similar procedure would not operate successfully in other numerical (i. e. realistic) cases. These two examples suggest that simulation can provide both an easy solution and considerable insight into the nature of at least some problems hitherto addressed with Jeffrey's Rule. Whether this is true of all such problems, or whether Jeffrey's Rule handles problems that simulation cannot, or whether Jeffrey's Rule provides insight over and beyond what simulation provides in some or most cases, is at present unknown. An answer would require canvassing and analyzing many such problems, in a variety of contexts. 2. A Non-Problem of Lewis Carroll This is Lewis Carroll's Pillow Problem 41 (1895/1958, pp. 9, 62, 63): My friend brings me a bag containing four coun- ters, each of which is either black or white. He bids me draw two, both of which prove to be white. He then says "I meant to tell you, before you began, that there was at least one white counter in the bag. However, you know it now, without my telling you. Draw again." (1) What is now my chance of drawing white? (2) What would it have been if he had not spoken? Carroll gives the following solution: (1) As there was certainly at least one W in the bag at first, the 'a priori' chances for the various states of the bag, 'WWWW, WWWB, WWBB, WBBB,' were '1/8, 3/8, 3/8, 1/8'. These would have given, to the observed event, the chances 'I, 1/2, 1/6, O'. Hence the chances, after the event, for the var- ious states, are proportional to '1/8.I, 3/8.1/2, 3/8.1/6'.; i.e. to '1/8, 3/16, 1/16'; i.e. to '2, 3, I'. Hence their actual values are '1/3, 1/2, 1/6'. Hence the chance, of now drawing W, is '1/3.I+1/2.1/2'; i.e. it is 7/12. Q.E.F. (2) If he had not spoken, the 'a priori' chances for the states 'WWWW, WWWB, WWBB, WBBB, BBBB', would have been 'I, 4, 6, 4, I'. 16 These would have given, to the observed event, the chances 'I, 1/2, 1/6, O, O'. Hence the chances, after the event, for the var- ious states, are proportional to '1/16.I, 1/4.1/2, 1/6.3/8'; i.e. to 'I, 2, I'. Hence their actual values are '1/4, 1/2, 1/4'. Hence the chance, of now drawing W, is '1/4.I+1/2.1/2'; i.e. it is 1/2. Q.E.F. Let us consider how one would physically simulate this problem. You begin with two white balls in your hand - the ones you have already selected. Then you assume that each of the other two balls is either white (W) or black (B). To correspond with these facts and assumption, one could then make up one bag with WW, another with BB, a third with WB, and the fourth with BW. On any given trial one would a) select one of those bags at random, b) combine the two white balls in hand with the balls in the bag, and c) draw a ball. The process is so simple that we can confidently forego actual simulation and deduce that the probability of a white would be .25 * 1 (from the WWWW bag), .25 * .5 (WWBB), and .5* 3/4 (from WWWB or WWBW), or 6/8. This is a different answer than Carroll obtained. But this seems to be the answer that fits any concrete realization of the facts of the situation. If one now considers the second part of Carroll's question, the answer is quite the same as for the first part, because the actual facts - including one's state of knowledge - are the same in both cases. In his study of the probabilistic Pillow Problems, Seneta (1984) concurs that Carroll may not have arrived at the correct answer, saying that "Dodgson [Carroll] may have had some difficulty in handling conditional probabilities" (1984, p. 88). The important question here is: Why did Carroll arrive at a different answer than arrived at above? I suggest that the answer is that his purely-deductive calculations allowed him to depart from the physical facts. Over-abstraction often has this pernicious property. Simulation can often save one from falling into such error. CONCLUSION Bayesian problems of updating estimates can be handled easily and straightforwardly with simulation, whether the data are discrete or continuous. The process and the results tend to be intuitive and transparent. Simulation works best with the original raw data rather than with abstractions from them via percentages and distributions. This can aid the understanding as well as facilitate computation. REFERENCES Box, George E. P., and George C. Tiao, Bayesian Inference in Statistical Analysis (Reading, Mass: Addison-Wesley, 1973) Carroll, Lewis, Pillow Problems (New York: Dover, 1895/1958). Diaconis, Persi, and Sandy L. Zabell, "Updating Subjective Probability." Journal of the American Statistical Association, vol 77 (1982), pp. 822-830, reprinted in Decision Making, edited by David E. Bell, Howard Raiffa, and Amos Tversky (Cambridge: Cambridge University Press, 1988), pp. 266-83. Feller, William, An Introduction to Probability Theory and Its Applications (New York: Wiley, 3rd edition, 1968) Fisher, R.A. Statistical Methods and Scientific Inference, second edition, (London: Oliver and Boyd, 1959) Gardner, Martin, The Second Scientific American Book of Mathematical Puzzles & Diversions (New York: Simon and Schuster, 1961). Huff, Darrell, How to Take a Chance (New York: W. W. Norton, 1959). Jeffrey, Richard, The Logic of Decision (New York: McGraw- Hill, 1965), referred to by Diaconis and Zabell. Jeffrey, Richard, "Probable Knowledge", in The Problem of Inductive Logic, ed. I. Lakatos, (Amsterdam: New Holland, 1968) pp. 166-180), referred to by Diaconis and Zabell. Kahneman, Daniel, and Amos Tversky, "Subjective probability: A judgment of representativeness," abbreviated version of a paper originally appearing in Cognitive Psychology, 1972, 3, 430-454, reprinted in Judgment under uncertainty: Heuristics and biases, edited by Kahneman, Daniel, Paul Slovic, and Amos Tversky (Cambridge: Cambridge University Press, 1982), pp. 32-47. Nisbett, Richard E., David H. Krantz, Christopher Jepson, and Geoffrey T. Fong, "Improving inductive inference," Judgment under uncertainty: Heuristics and biases, edited by Kahneman, Daniel, Paul Slovic, and Amos Tversky (Cambridge: Cambridge University Press, 1982), pp. 445-459. Piattelli-Palmarini, Massimo, Inevitable Illusions (New York: Wiley, 1994). Schlaifer, Robert, Introduction to Statistics for Business Decisions (New York: McGraw-Hill, 1961). Seneta, Eugene, "Lewis Carroll as a Probabilist and Mathematician", Mathematical Scientist, vol 9, 1984, pp. 79-94. Simon, Julian L., "What Does the Normal Curve `Mean'?" Journal of Educational Research, Vol. 61, July-August, 1968, pp. 435-438. Simon, Julian L., "What Some Puzzling Problems Teach About the Theory of Simulation and the Use of Resampling," The American Statistician, November 1994, Vol. 48, No. 4, pp. 1-4. Tversky, Amos, and Daniel Kahneman, "Evidential impact of base rates," in Judgment under uncertainty: Heuristics and biases, edited by Kahneman, Daniel, Paul Slovic, and Amos Tversky (Cambridge: Cambridge University Press, 1982), pp. 153-162 Whitworth, W. A., Choice and Chance (5th edition), (Cam- bridge: Deighton Bell, 1901). ENDNOTES 1 Darrell Huff provides the quote but without reference: "This branch of mathematics [probability] is the only one, I believe, in which good writers frequently get results entirely erroneous" (Huff, 1959, frontispage). 2We can use a similar procedure to illustrate an aspect of the Bayesian procedure that Box and Tiao emphasize, its sequentially-consistent character. First let us carry out the above procedure but only three black balls in a row being observed. The program to be used is the same except for the insertion of "3" for "7" where "7" appears. We then estimate the probability for BB, which turns out to be about 1/5 instead of about 1/65. We then substitute for urn A an urn A' with appropriate numbers of black Bb's and black BB's, to represent the "updated" prior probability. We may then continue by substituting "4" for "3" above (to attain a total of seven observed black balls), and find that the probability is about what it was when we observed 7 black balls in a single sample (1/65). This shows that the Bayesian procedure accumulates information without "leakage" and with consistency. APPENDIX Program for Fisher's mice problem: [by Peter Bruce] NUMBERS (1 2 2) test the urn with test mice, 1=BB and 2=Bb NUMBERS (3) bb urn with "bb" mice At this point Peter Bruce - who wrote the program - resorts to a trick of adding the results of the two different sampling operations to identify particular types. This enables him to avoid some further programming with IF loops. I worry about confusing the reader with this trick, but I can afford to be pure about it because he is doing the work and not I. COPY 0 n "n" will be the number of simulations WHILE n < 1000 repeat the following as long as n has not reached 1000 SAMPLE 1 test test* sample a test mouse SAMPLE 1 bb bb* sample a brown mouse REPEAT 7 simulate 7 "matings" ADD test* bb* c "mate" the mice; if "c" is a "4", it's a BB-bb mating, which always yields black offspring. If "c" is a "5", it's a Bb-bb mating which produces black half the and brown brown the other half. If the latter is the case, we need a further simulation to determine the color. We let 111 represent a black outcome, 222 a brown outcome. IF c = 5 if we have a Bb-bb mating URN 1#111 1#222 d offspring is 50/50 black/brown SAMPLE 1 d e "e" will be either 111 or 222 END IF c =4 if we have a BB-bb mating COPY 111 e "e" must be 111 END SCORE e y Keep track of each of the 7 births END End the "mating loop" SUM y yy IF yy = 777 If all seven births were black SCORE test* z Score the genetic character of test mouse END CLEAR y Wipe out the birth scoreboard in preparation for a new trial SIZE z n Determine how many simulations have been run so we can stop at 1000 (see top) END We can proceed past here once the WHILE condition is satisfied COUNT z =1 k How often was test mouse a BB? DIVIDE k n kk PRINT kk kk = .987. Program for Bertrand's problem (Spanish treasure fleet) using the language RESAMPLING STATS: [by Peter Bruce] NUMBERS (7 7) gg The 3 boxes, where "7"=gold, "8"=silver NUMBERS (7 8) gs NUMBERS (8 8) ss REPEAT 1000 GENERATE 1 1,3 a Select a box where gg=1, gs=2, ss=3 IF a =1 SCORE 1 z If a=1, we're in the "gold-gold" box. That means we've picked a gold, and we're guaranteed of getting another gold (7) on our second pick. So we score a "1" for success. END IF a=2 If a=2, we're in the gold-silver urn SAMPLE 1 gs b Select a coin IF b =7 If b=7, we got a gold, so score 0, (for no success) because we can't get a 7 again. SCORE 0 z END Note: if b=8, we got a silver on our first draw and we're not interested in the second draw unless we get a gold first. END Note: if a=3, we're not interested either. We can't draw a gold on our first draw. END SIZE z k1 How many times did we get an initial gold? COUNT z =1 k2 Of those times, how often was our second draw a gold? DIVIDE k2 k1 result Calculate the latter as a proportion of the former result = 0.64797