WHY THE FORMAL METHOD IN STATISTICS IS USUALLY THEORETICALLY INFERIOR Julian L. Simon You are standing in the warehouse of a playing-card factory that has been hit by a tornado. Cards are scattered everywhere, some not yet wrapped and others ripped out of their packages. The factory makes a variety of decks - for poker without a joker, poker with a joker, and pinochle; magician's decks; decks made of paper and others of plastic; cards of various sizes; and so on. Two hours from now a friend will join you for a game of near-poker with these cards. Each hand will be chosen as randomly as possible from the huge heap of cards, and then burned. What odds should you attach to getting the combination two-of-a-kind - two cards of different or the same suit but of the same number or picture - in a five-card draw? Ask this question of a professional probabilist or statistician, and - based on the small sample I have taken - s/he is likely to say "I don't have enough information". There is even a name for this sort of question: Problems lacking structure. Ask the same question of a class of high-school students or college freshmen and you will quickly get the suggestion, "Draw hands from the card pile the same way you will draw them when you play later, and see how often you get two-of-a-kind". Who produces the better (more useful) reply - the "naive" students, or the learned statistician/probabilist? (If the question had been framed as the probability of getting [say] the jack of spades in a poker hand drawn from the pile, the probabilist probably would think of suggesting a sample. Apparently it is the combination of elements that leads the trained person to say that the job cannot be done.) This case reminds one of the three-door problem, in which resampling immediately produces the correct answer whereas trained intellects almost uniformly arrive at the wrong answer. The untutored person's try-it procedure is, in this case, not only as good as any procedure can be, but better than any formal procedure can be, even in principle. One reason is that the probability of any given hand in the warehouse is affected by the physical properties of the cards - their sizes and materials. The various cards are not perfectly alike, just as a die cannot be perfectly true; even a bit of purposeful shaving of a die's edge can affect the odds enough to enable a gambler to cheat successfully.) But an empirical estimation with an actual sample-and-deal procedure includes the effect of these physical influences, whereas any more abstract approach has great difficulty doing so. Another issue: You might also want to estimate the chance of a three-of-a-kind hand. You quickly recognize that this event does not happen very often, and it will take many hands to estimate its probability. So you consider this procedure: take a sample of (say) 1000 cards, record their values, transform those values to a form that a computer can read, then program the computer to choose (with replacement, now) five cards at random from the 1000, and examine many trial hands (say 10,000) to see whether there are three-of-a-kind. The computer procedure should be as close an analog as possible to physically shuffling and dealing five-card hands from the 1000 sampled cards. Please notice that one need never know how many of each type (that is, face value) of card the sample contains. Rather, as each of the cards is examined, its value is transmitted to the computer. It is unnecessary to calculate any sample space or any partition of it; one never needs to know that there are 2, 598, 9600 or whatever number of possible poker hands. (Goldberg, 1960, p. 305) A probabilist might suggest computing the chance of three-of-a-kind from the same 1000 pieces of information by using probability theory. Both these procedures will arrive at much the same result. Both fail to take account of physical factors - size, and type of material - that might affect physical trials with the 1000 sampled cards. The simulation will be slightly less "exact" than the theoretical calculation, the lesser exactness being made as small as desired by increasing the number of computer trials; the loss of accuracy surely will be very small relative to the sampling error deriving from choosing the 1000 cards from the huge pile - including both the random- sampling error and the bias due to not drawing the sample randomly. And of course the formal calculation in this case will be quite tricky and prone to error. It must assess the size of the sample space of three-of-a-kind hands when the numbers of cards of various values will differ, both because the factory makes different numbers (no jacks, queens, and kings in some decks, for example) as well as because of the inaccuracy due to the sample of only 1000 cards. In contrast, the sample space need never be known for physical or computer resampling. EXPLANATION OF THE ADVANTAGE OF RESAMPLING Lighter Conceptual Burden In general, the conceptual burden in resampling is much slighter than in probability theory; this is one of resampling main advantages. One does not need to be able to add or even to count in order to conduct individual experimental trials. One only needs to know the concept of counting, and also the concept of a ratio, so as to (first) keep a record of the numbers of successful and unsuccessful trials, and (second) add to get the total trials and dividing to get the ratio of successful to total. Certainly the discipline that applauds the likes of Peano, Russell, and Whitehead for boiling down mathematics to its most fundamental elements should have some appreciation for an intellectual method that gets along so successfully with so little recourse to higher abstractions. Consider, for example, the case of the probabilities of various numbers of points when throwing two dice (refer to Goldberg, 1960, p. 158ff). When specifying the sample space, etc., one needs to add the two top faces of the dice to determine the range of the function. With simulation it is not necessary to ever determine this range; one simply tosses the two dice and inspects the outcomes. One can ask the probability of getting "13" (or any other number) and get an answer experimentally without knowing the range in advance. Reducing the Extent of Abstraction from Actual Experience Robert Shannon, in a book on Systems Simulation, constructs a continuum from "Physical models" to "Scaled models" to "Analogy models" to "Computer simulation" to "Mathematical models" (1975, p. 8). (I would add experimentation with the actual material of interest as a stage even less abstract than Physical models.) At each successive stage of translation to greater abstraction one runs the risk of losing some important aspect of experiential reality, and of introducing misleading assumptions and simplifications. This argues for abstracting as little as possible, doing so only to the extent that it is necessary. As Shannon's continuum suggests, simulation methods in statistics (with or without a computer) are less abstract than are distributional and formulaic methods, and they should be less at risk of error. This speculation jibes with the experimental evidence that people can attain more correct answers to numerical problems with resampling methods than with formulaic methods, when given the equal amounts of instruction (Simon, Atkinson, and Shevokas, 1976). Of course the optimal level of abstraction depends upon the circumstances. If one wants to estimate the probability of a given sum with four dice in order to maximize one's chance of winning with those particular dice, experimenting with those very dice is likely to be optimum, but if one wants to know the odds with four dice in other circumstances, a more abstract approach may be better. However, there are very few circumstances in which the formulaic and distributional abstractions are likely to be better than Monte Carlo methods (lack of data being one such circumstance, and low probability being another). Operationalizing the Problem A third virtue of resampling may be stated as: If you understand the posing of the problem operationally, you automatically will obtain the correct answer. For example, consider this probability puzzle from Lewis Carroll's Pillow Problems (by way of Martin Gardner, correspondence, May, 1993): A bag contains one counter, known to be either white or black. A white counter is put in, the bag shaken, and a counter drawn out, which proves to be white. What is now the chance of drawing a white counter? The issue is, do I state the problem correctly in steps 1-4 below? If I do, that implies that the repetition of the process in those steps will lead to a correct answer to the problem. 1. Put a white counter (later have the computer call it "7" to avoid confusion) or a black counter (call it "8") in the urn with probability .5. 2. Put in a white and shuffle. 3. Take out a counter. If black, stop. 4. (If result of (3) is white): Take out the remaining counter, examine, and record its color. 5. Repeat steps 1-4 (say) 1000 times. 6. Compute how many trials yielded a white first. 7. Count the number and compute the proportion of whites ("7s") among the "white first" trials. The benefits of the operationalization of problems that occurs with simulation can be seen in a different way in another problem of Lewis Carroll's: Given that there are 2 counters in a bag, as to which all that was originally known was that each was either white or black. Also given that the experiment has been tried a certain number of times, of drawing a counter, looking at it, and replacing it; that it has been white every time...What would then be the chance of drawing white? (p. 15). This problem was an eye-opening experience for me. First I wrote down a set of steps to handle the problem with white and black balls ("counters"). But I did not actually execute the procedure. Instead, while I was waiting for an associate to write a computer program to solve the problem, following the steps I had outlined, I set out to explain the problem logically. I wrote five nice pages of what I thought to be clear explanation. A few days later I reread the steps I had written down. But now I found that I could not understand the logic. This experience shows how easy it is to get confused with Bayesian problems of this sort if one works analytically rather than with simulation. So I tried harder to create a simulation - and harder - and harder. And then I found that I simply could not create a simulation that would model the problem as Carroll wrote it ( and as I understood it). Apparently I was as confused as anyone could be. What to do? I decided to go back to my very basic principle: There must be a way to physically model every meaningful question in probability and statistics. If one cannot find a way to model a simulation for the problem, maybe there is something wrong with the problem rather than with my modeling. And indeed, when we examine it closely, we may see that Carroll's problem is not operational and hence not meaningful. The difficulty turns out to lie in Carroll's phrase "given that the experiment has been tried a certain number of times, of drawing a counter, looking at it, and replacing it; that it has been white every time". In Carroll's solution he indicates that he believes that it is possible to infer a probability for the next trial on the basis of a series of trials that are all successes. This is a famous formula in probability theory - that the probability is n/(n+1), where n is the number of observed successes. But probability theorists such as Feller have argued (correctly, in my view) that this formula is not meaningful. And the fact that it is not possible to model the formula meaningfully in this context confirms that theoretical analysis. So once again the act of attempting to create an operational simulation of a problem and then actually executing the procedure has kept our feet on solid ground and off the slippery slope into confusion or meaninglessness. LIMITS OF THE RESAMPLING METHOD Low Probabilities Can the formal method be better in any respect? Yes, it can. If you want to estimate the chance of a royal flush in poker, which probably would happen only once in hundreds of thousands or millions of trial hands, taking samples by sitting on the floor of the warehouse for a few hours and dealing hands will not produce a sound estimate. And even computer sampling might be much less accurate than analysis without an inordinate amount of computer time devoted to the problem. But will the formal method surely be better for the royal flush? No. There is an excellent chance that anyone except a very skilled probabilist will use the wrong calculating formula, and the erroneous answer might well be worse than no answer at all, and worse than computer sampling or perhaps even sampling by hand. This realistic possibility of conceptual analytic error cannot be ignored in any practical situation. It is as much a source of possible error as the sampling procedure, physical characteristics of the cards, and unsound computer programming if a computer is used. Just as with the calculation of the possibility of a disaster at a nuclear reactor, each possible source of trouble must be gauged and allowed for in proportion to its likely importance. None can be dismissed as being avoidable "in principle" by proper handling. Small Samples Imagine a sample of the heights of four persons. You wish to estimate a confidence interval for the population mean or median. It is rather obvious that the interval should go beyond the range of the four observations, but a resampling procedure will never give that result. Does this mean that resampling is inferior here to the conventional method using (say) the t test? Implicit in the conventional method is an assumption about the shape of the distribution. Making this assumption is in no way different in principle from a Bayesian prior. And the nature of the assumption is crucial. An assumption that would be appropriate for heights would not be appropriate for incomes. Once we have established that it is necessary to bring outside information and judgment to bear, we can then consider doing so with the resampling method as well as the conventional method. We need not enter into technical details here, but there are many possible ways to coordinate the observations to any shape of distribution in such fashion as to estimate its dispersion, and then to draw samples from the distribution to estimate confidence limits. This would not seem inferior to the conventional method. And if one made the assumption of a peculiar type of distribution, the advantage would seem to be with the resampling method, though this subject needs more exploration. WHAT ABOUT "USUALLY"? The title of this article says that the formal method is "usually" inferior. This assertion assumes that most applications of probability and statistics deal with situations and probabilities that lend themselves well to direct physical sampling and/or to the resampling procedure on the computer. This very general assertion, of course, might be refuted by systematically-gathered evidence. What is most important, however, is not the general assertion but rather choosing the method that is right for each particular situation. The card-warehouse example lacks realism. But estimating the probability that there will be two faults in a particular piece of machine output, where the probability of each fault seems to be independent of each other, is not very dissimilar, though the probability model is rather different. And a quite analogous realistic set of problems was the basis for Galileo's, and then Pascal's and Fermat's, foundational work with dice games in formal probability theory that proceeded by assessing the sample space and partitions of it. But experimentally estimating the odds as gamblers previously had done had led to sounder answers than even such great minds as Gottfried Leibniz had arrived at with deductive methods (cited by Hacking, 1975, p. 52). Why argue that formal methods are often inferior in principle? One of the objections to resampling in statistics is that it is "only" an imperfect substitute for formal methods, and that the passage to formal methods represents an advance over simulation methods. For example, when William Kruskal compared the early statement of resampling methods in the stark terms of the necessary operational procedures, versus developments in the literature later on, he dismissed the importance and value of the former by saying that the latter embodies "real mathematics" (personal correspondence, 1984). There is an important analog between the lack of exactness in resampling and the movement in modern physics and mathematics, since Poincare and Bohr, away from Newtonian deterministic analysis of closed systems and toward non-deterministic analysis of open systems. (See Ekeland, 1988, for an illuminating discussion of this movement.) Probability theory is a set of exact closed-form replicas of inexact open physical situations, of which the card warehouse is an example. (A sample of 1000 cards taken from the warehouse, and then converted to equally- weighted entities converts the open system to a closed system.) That is, when calculating the probability of two-of-a-kind in a poker hand, the sample space and the partition containing that subset are exact numbers even though in any actual situation there are incalculable elements such as the different weights of the cards due to the different amounts of ink on them, their slightly different sizes, and so on. I am not criticizing the exact model for not being an inexact replica, any more than a photograph should be criticized for not being a perfect replica of the scene it portrays. But to claim that the photograph is a truer form than is the scene itself, or to claim that probability theory is more exact than a physical manipulation which is the very subject of interest - that is, to claim that the calculation of getting a pair of "2s" with two given dice is more exact than a million throws of the same two dice - is hardly supportable. The probabilist will reply that the calculation does not refer to a particular pair of dice. But the scientist and the decision maker are always interested in some particular physical reality - a given comet, or the price of corn tomorrow - and if probability theory is to be judged in other than by an esthetic test, it must be judged on its helpfulness in these particular situations. In contrast, resampling - especially physical experiments with the elements whose that constitute the situation to be estimated - is inescapably inexact. It is ironic that it is criticized for that mirroring of reality. REFERENCES Ekeland, Ivar, Mathematics and the Unexpected (Chicago: U. of Chicago Press, 1988) Feller, William, An Introduction to Probability Theory and Its Applications (New York: Wiley, 1950) Goldberg, Samuel, Probability - An Introduction (New York: Dover Publications, Inc., 1960). Hacking, Ian, The Emergence of Probability New York: Cam- bridge U. P., 1975, pp. 166-171 Shannon, Robert, Systems Simulation (Englewood Cliffs: Prentice-Hall, 1975).