CHAPTER II-5
THE PLACE AND ROLE OF BAYES' RULE AND BAYESIAN ANALYSIS
Heated argument has raged about Bayes' rule for two
centuries, involving fundamental questions about the
possibilities and practices of statistics such as the disputed
concept of inverse probability. This note attempts to bring some
reconciliation and clarity to the disputes.
No statistician doubts that Bayes' formula is the
appropriate procedure for a variety of actual decision-making
situations, including (1) business decisions such as whether to
buy a used car in light of a judgment of the car's mechanical
soundness by a skilled mechanic, and (2) analysis of medical
screening and clinical decisions where there are both an a priori
probability of a disease and a test for the individual patient.
Disputes concern (1) the ability of Bayesian thinking to
establish the cause of a set of observed observations
(induction), and (2) the role of judgment in judging the validity
of scientific hypotheses; these two latter issues are almost
completely separate from each other and from the decision-making
uses of Bayes' rule.
The argument centers on whether Bayes' rule should be used
mainly for such problems as those mentioned above, or whether it
should underlie all statistical work. Savage calls for the use
in all cases. He writes:
[O]ne of the most striking symptoms of the inadequacy of
statistical theory without subject probability is the
lack of unity that such theory has had (1962, p.
13)...Fisher does not present us with a unified method
(p. 14)...It has sometimes been contended that there are
two different kinds of statistical theory, one
appropriate to economic contexts and another to pure
science; see for example, Fisher. In my own opinion,
this dualistic view is incorrect (p. 15).
Savage arrives at this judgment apparently on the grounds that
such consistency is inherently better than a piecemeal approach
of using a different approach for general scientific work and in
specific decision situations; he does not, however, give a
utilitarian argument for this judgment, despite his own emphasis
on bringing expected utility into decisions.[1]
Bayes' rule stands at the nexus of probability theory and
statistics, and points both ways. This explains why it has been
such a challenging idea for so long.
I will start by discussing the basic mechanism of Bayes'
rule as an intellectual tool. Then we will pass to the difficult
issues.
Bayes' Rule and Complex Problems in Conditional Probability
To leading mathematicians writing about probability theory,
Bayes' rule is just one particular mode of writing and using the
idea of conditional probability. (See e. g. Feller, 1950;
Goldberg, 1960). Indeed, Feller directs people away from using
Bayes' formula and toward the simple formula for conditional
probability
p(A!B) = p(A,B)/p(B).
"The beginner is advised always to [use the simple formula] and
not to memorize [Bayes'] formula". He goes on to say that Bayes'
formula "is a special way of writing [the simple conditional
probability formula] and nothing more" (1950, p. 85). Feller is
quite wary of Bayes' rule, perhaps because of what he considers
unwarranted uses to which it has been applied, as we will see
later.
All but the simplest problems in conditional probability are
confusing to the intuition even if not difficult mathematically.
Indeed, a large proportion of mathematical-probabilistic puzzles
are of this nature. To make clear the nature of the rule, I
shall start with the simplest sort of problem that we can handle
with first principles, and proceed gradually from there.
Consider the probability of dealing the ace of spades from a
poker deck of 52 cards. Clearly this "unconditional probability"
is 1/52, deduced from our full knowledge of the mechanism that
may be described by the sample space. We assume away the fact
that there may be slight differences in probabilities for
different cards due to different weights, and so on - an
assumption which looms small here, but which is logically crucial
in enabling us to create the closed system necessary for the
application of probabilistic reasoning.
Now assume that we seek the probability of dealing an ace if
the ace of hearts has already been dealt. The deck now contains
51 cards, and clearly the probability is 3/51; we are deducing
what is a conditional probability from our full knowledge of the
sample space. All would agree that this is a problem in pure
probability rather than statistical inference even though the
estimate depends upon sample data - the observation of the ace of
hearts - and hence the estimate may be considered an a posteriori
probability.
Now consider this question: You make a deck of one ace and
one king, shuffle, deal a card, record whether ace or king, then
replace and repeat. What is the probability that if one of the
cards you pick is an ace, the other is an ace as well? I suggest
you write down your answer before proceeding further.
This problem was published in the Sunday newspaper
supplement Parade in the following guise:
A shopkeeper says she has two new baby beagles to
show you, but she doesn't know whether they're male,
female, or a pair. You tell her that you want only a
male, and she telephones the fellow who's giving them a
bath. "Is at least one a male?" she asks him. "Yes!"
she informs you with a smile. What is the probability
that the other one is [also] a male? (vos Savant,)
The Parade columnist gave the answer as 1 in 3. The problem is
so confusing that PhDs wrote her to say - with great confidence -
that she was all wrong.
The following simulation not only shows how to handle the
problem expeditiously, but also shows why this sort of problem is
so puzzling:
Consider a two-digit column of random numbers in Table II-5-
1, using odd numbers for females and even for males. The first
forty lines are sufficient to suggest the correct probability,
and also to make clear the mechanism: Two-female pairs, a fourth
of the cases, are excluded from the sample. And mixed pairs -
which give a "no" answer - are two-thirds of the remaining pairs,
whereas the only pairs that give "yes" answers - two males - are
only a third of the remaining pairs. Simulation gets it right
very quickly and easily, whereas the deductive method of
mathematical logic results in much confusion.
Table II-5-1
Now let us consider still another problem, a famous puzzler
that comes from Bertrand. You face three chests with two drawers
each. The drawers in chest A contain a gold coin each, one
drawer in chest B contains a gold coin and one a silver coin, and
the drawers in chest C each contain a silver coin. You choose a
chest and a drawer randomly, and your probability of finding a
gold coin is 1/2. You then find a silver coin. What is the
probability that the coin in the other drawer of that same chest
is gold? (Adapted from Schuh, 1968, p. 166).
This situation is similar to the card-drawing case above in
that we begin with a known sample space, make one observation
without replacement, and ask about the probability of another
observation. The only obvious difference is that in this case
the subsequent sample space is a bit more complex - one must
exclude chest A from the analysis because it does not contain a
drawer with a silver coin. And this problem is solved with a
process akin to Bayes' rule. (If you want to know the answer and
how to do it with simulation, see Simon, 1994. Other puzzles are
found in Chapter 00, including the famous "Monty Hall" or "three
doors" problem.)
Bayesian enthusiast Arnold Zellner writes: "Note that
Bayes' inverse problem is fundamentally different from those
encountered in games of chance, for example coin flipping, in
which the probabilities of outcomes are known and the
probabilities of various outcomes must be calculated" (1987, p.
208). Contrary to Zellner's view, it seems to me that Bertrand's
problem has exactly the characteristics Zellner mentions, and
also is a quintessential Bayesian problem (as Schuh notes, pp.
176-177); yet it is a pure problem in probability.
Zellner goes on to say that gambling games are "problems in
direct probability", to which he contrasts "Bayes' inverse
probability problem, [wherein for example] five heads in six
flips of a coin are observed and what must calculated or inferred
is the chance that the probability of a head on a single flip
lies in a given interval, say 0.4 to 0.7" (p. 208, italics in
original). But the beagles problem and Bertrand's problem both
seem to me quite in the tradition of direct probability, and are
traditionally treated in texts on probability.
A common and exceedingly valuable contemporary practical
use of Bayes' rule is probability analyses of medical screening
and diagnosis, commonly described in texts on biostatistics.
These classic applications of Bayesian analysis can be modeled
with urns in a standard probability context. For non-
mathematicians the analysis is understood most easily with Venn
(box) diagrams and probability trees; the formula is the least
revealing mode for non-mathematicians.
Let's consider for specificity an analysis of widespread
screening for a type of cancer which we shall call C (example
adapted from Wonnacott and Wonnacott). Assume C occurs in 3 of
1000 Americans. An early-warning screening test has been found
to produce the following results: In patients without the
cancer, the test shows 5 percent positive, 95 percent negative;
in patients with the cancer, the test shows 98 percent positive,
2 percent negative. The central questions is: What proportion
of the positive results will indicate that the patient actually
has cancer?
Applying the formula above, we get
p (cancer given a positive test) = proportion of
population that have cancer and show a positive result
divided by the proportion of the population that show a
positive result =
(.98 * .003)/ (.98 * .003) + (.05 * .997) =
(.0029)/(.0029 + .0499) = .056
Bayes' rule shows its value dramatically here because of the
counter-intuitive nature of the results. Most laypersons guess
that the answer is about 90 percent, and the 5.6 percent answer
astonishes them. In many other such cases the probabilities that
are calculated also are wildly different than one expects
beforehand.
Here is an example of the diagnosis of an individual patient
(adopted from Roberts, p. 490): Based on preliminary
examination, a doctor estimates the probability to be .005 that
Patient F who suffered a blow on the head has a fractured skull.
Experience suggests that an x-ray will give positive results in
.96 of cases where there is a fracture, and .1 of the cases where
is no fracture. If the doctor takes an x-ray and finds a positive
result, what is the chance that the patient really has a
fracture? The surprising answer is .046. Now an example in the diagnosis of whether a used car has a
faulty transmission or not (adapted from W and W, p. 93): Prior
data show that 30 percent have a fault. A mechanic labels 90
percent of faulty cars as "faulty", 10 percent as "okay"; he
labels 20 percent of the OK cars "faulty", and 80 percent "okay".
Bayesian analysis shows that the probability that the car will be
faulty if he says "okay" is .05. So with the additional informa-
tion one's surety of getting an okay car is raised from .7 to
.95.
A typical business example is a decision about whether to
produce a new product, with the key factor being the demand for
the product. The firm typically has a prior rough intuitive
assessment of demand. Market research (which is costly) can
provide improved estimates. One first calculates (using Bayesian
analysis and backward induction) whether the expected benefit of
the research is worth its cost, and then if the firm goes ahead
with the research, Bayesian analysis is used to assess whether to
go ahead with the new production.
These everyday uses of Bayes' rule, and the modes of
presentation of the concept in such basic texts as those of
Feller and Goldberg, should make clear that Bayes' rule is a
standard mathematical shortcut for investigating sample spaces
and clarifying problems that challenge the intuition. This
everyday quality hopefully demystifies it and makes it seem
unlikely as a device for entering into the misty unknowns of
ascertaining causes that cannot be identified in other ways.
Elsewhere [Simon, 1996] I discuss why problems in Bayesian
analysis are so hard to master intuitively. The difficulty lies
in shifting from an analysis of the entire sample space to some
part of it - from the entire population to one or more sub-
groups. This requires a long chain of reasoning containing
several switches in intuitive direction. This may also be seen
in the examples that are worked later in this chapter.
INVERSE PROBABILITY AND THE SEARCH FOR CAUSES
The use of Bayes' rule as a device for identifying causes of
observed phenomena was the original motive for Bayes' work, and
later for Laplace's. Bayes' rule is still believed by some to be
successful in that quest. For example, Dennis Lindley says that
Bayes' "theorem must stand with Einstein's E=mc2 as one of the
great, simple truths" (1987, p. 208), and Stephen Stigler calls
it a "truly Copernican revolution in statistical concept" (1098,
p. 122). Zellner (1987, p. 208) joins in giving Bayes "credit
for solving the famous `inverse probability problem'".
Other statisticians such as Ronald Fisher, and probabilists
such as Feller (see above) believe (and I share their view) that
Bayes' rule fails entirely in what it purports to do - ascertain
causes and assign "inverse probability," though it certainly is a
great and valuable idea for some applied purposes as discussed
above.
The underlying notion of inverse probability is that if one
can examine a mechanism that generates outcomes of some sort and
logically evaluate the probabilities of the various outcomes, it
should be possible to examine a set of outcomes and logically
assess the probabilities of various causes.
Everyone understands that as a) your sample grows larger, or
b) the dispersion grows smaller, you have less uncertainty about
specified parameters. But this does not imply that one can by
calculation identify causes among an unspecified set of possibil-
ities and assign probabilities to them of being causes.
Bayes' rule certainly seems to point to causes in some
situations. The case of medical diagnosis is particularly
clearcut: the combination of a prior probability of disease plus
the result of a test on a particular patient (the tests have a
high but not perfect record of being correct whether negative or
positive) seems to be a probabilistic identification of a cause.
But one is only choosing between two specified hypotheses -
presence or absence of the disease - which enables one to have a
closed sample space within which the problem is one of pure
"direct" probability; this is quite different than what Bayes and
Laplace had in mind, about which more later.
The basic quest for a method of doing what Bayes and Laplace
sought to do, other than in restricted situations, seems impossi-
ble. I will first offer a few brief remarks pertaining to this
conclusion, and then I will explore the matter at greater length
with examples.
Fisher rejected the concept of inverse probability - which
for him is synonymous with induction - on a variety of grounds :
1) "apparent mathematical contradictions"; 2) "Its truth should
be apparent to any rational mind", and it is not (certainly not
to Fisher's mind); and 3) it "has been only very rarely used" in
science (1966/1990, pp.6-7). He notes that a "well-defined
distribution" is necessary a priori (1973/1990, p. 54), and when
this does not exist, it commonly is brought in by "mathematical
sleight-of-hand" (1966/1990, pp.198).
Concerning Fisher's objection in the paragraph above that
Bayes' rule is "rarely used" in science, it is relevant that such
well-known books specifically devoted to Bayesian analysis as
those of Press (1989) and Hartigan (1983) do not contain any real
or even realistic examples of the use of the analysis for the
most common sort of statistical analysis in social science and
biostatistics: hypothesis testing or confidence intervals.
Geweke (1988) labors mightily to show some actual uses in
econometrics, the field that has expressed most interest in such
use. But the difficulty of Geweke's examples rather than being
the everyday sorts of tests, the need for complex computer
algorithms, and the fact that the examples are not the sort of
use of Bayes rule that one ordinarily imagines - a reasonably
sharp prior distribution based on prior knowledge that is then
modified in light of the sample evidence - all combine to make
the case against common adoption of Bayes' rule in science rather
than for it.
Two other reasons why no statistical procedure can
automatically identify a cause: 1) There always remains the
possibility of a "hidden" or "third" variable which is related to
the measured variables but that is not considered explicitly in
the research. (See Simon and Burstein, 1985, for examples and
discussion of this phenomenon.) 2) The larger scientific-
statistical context, which must affect the interpretation of an
given set of data, is exceedingly difficult ( and perhaps impos-
sible) to capture within a Bayesian or other procedure. An
important example is the number of sets of data that have been
reviewed before the particular data set in question was analyzed,
and the number of variables that were examined before seizing on
the variables presently under investigation. This is the phenome-
non known to statisticians as "data dredging". For example, if a
hundred substances chosen at random are tested for an effect on a
given disease, five may be expected to show a "significant"
effect at the "95 percent significance level", even though none
of the five will be likely to have an effect if tested again; it
is for the same reason that so many random phenomena in our lives
such as "premonitions" and coincidences seem meaningful to us.
Feller points out that the manipulation of Bayes' rule has
led into many peculiar byways. He tells us that "Plato used this
type of argument to prove the existence of Atlantis, and
philosophers used it to prove the absurdity of Newton's
mechanics" (p. 85)
The operational reason for rejecting the notion of inverse
probability that I find most persuasive, however, is as follows:
One can program a computer to start with a causal device such as
a die or a pack of cards or a machine with given parameters that
produces ball-bearings - all assumptions being explicit and not
subject to argument - and it will produce the pattern of the
expected results in such fashion that all will agree that the
result is correct. And of course one can program a Bayesian
analysis of a medical analysis. But computer modeling of inverse
probability is not possible. One cannot begin with some pattern
of observations, and - with only computations that are made
directly from the computations - produce a set of alternative
causes of the observations and find probabilities for those
alternatives. The point is not that this cannot be done in
practice; rather, it cannot be done in principle. And if one
cannot even imagine a computer program of this sort, then any set
of formulae that purports to accomplish this goal must contain
some hidden assumptions that represent a kind of mumbo-jumbo.
The above paragraph expresses in a different way a basic
truth about statistical inference that is repeated throughout
this book: Every question that is originally framed as a ques-
tion in statistics or inverse probability eventually is answered
by manipulating some known universe and analyzing the results in
a manner which clearly is an exercise in direct probability.
<1>
Indeed, Stigler emphasizes that both Bayes and Laplace make
implicit assumptions about prior distribution that are crucial to
their analyses. But the implicit assumptions that they and
others make are often so complex that their presence and opera-
tion is obscured.
Why, then, is it possible to use Bayes' rule to put proba-
bilities on causes in the case of medical screening? Why is
Bayes' rule useful in sorting out Bertrand's puzzle of the chests
though the reasoning seems to work backwards (or inversely) from
the observed evidence? The explanation (as suggested earlier) is
that those cases constitute closed sample spaces. Whenever one
has such a closed space - and any open sample space may be closed
with judiciously chosen assumptions - then the analysis is simply
an exercise in probability theory; one way or another, one asks
about the behavior of a specified causal mechanism, rather than
discovering the causes. This is not inverse probability but
rather direct probability.
Let us now consider in more detail, and with examples, the
place of Bayes' rule in statistical inference. The context is
the relationship of probability to testing hypotheses and
estimating reliability.
Consider the extreme case of drawing from an urn ten red
balls in a row and nothing else, or ten slips of paper on which
are written "13". Obviously the probability is high that the urn
contains only reds or only "13's". But just how high is that
probability? No one has discovered a sound way to assess this
probability in the abstract, without additional information, and
no one is likely to. The probability certainly is not 1.0 that
the urn contains only reds or "13"s, because after pulling only
two or one reds from the urn, one would certainly not assign a
probability of 1.0 to the next ball being red. But no probabili-
ty other than 1.0 makes sense, either.
One popular formula has been p = n/(n+1), where n is the
number of successful trials (assuming there have been no
unsuccessful trials). But one would not assign a probability of
.5 to a second ball being red after just one has been taken and
found red. Feller tells us that this formula has led to many
historical cases of nonsense conclusions. And the formula has no
provision to allow for theoretical understanding, which is why
the probability that the sun will come up next day would have
been lower a thousand years ago than now, an example which Feller
properly considers nonsensical. Another example in which the formula is obviously defective:
Does the probability of your being alive the next day continue to
increase indefinitely the longer you go on living? The formula
applied in a theoretical vacuum implies that it does continue to
increase, but I doubt that any person of 80 believes this to be
true. This is one more case where a mathematical formula can be
so bewitching as to override the obvious need for judgment to
countervail the formula.
If it is not possible to put a probability on the parameter
in this case of no variation in the data (on the basis of the
data alone), why should one hope that one can do better when
there is dispersion in the results?
Consider also this related example: You are told that you
are to draw from one of two urns whose identities you do not know
- urn A with 100 red balls, and urn B with 100 black balls. You
draw and get a red ball. You will now predict with certainty
that the next ball you draw from the same urn will be red.
Furthermore, we say that you deduce the color of the next ball.
Now change the facts and have the Urn A contain 99 reds and
one black, and Urn B contain 99 blacks with one red. Again you
draw from a randomly-chosen urn and obtain a red ball. You
predict red for the next ball with almost as high a probability
as when each urn had only red or black balls. And perhaps it is
proper to say that you are deducing your probability estimate,
unlike the case of an urn whose total contents you do not know.
A key difference is that in this case the sample space can be
considered closed, and we can therefore calculate a probability
automatically by formula. But this case is not an analog to the
search for causes, wherein it is never possible to specify a
closed sample space. (Indeed, if the sample space could be con-
sidered closed, there would be no place for the discovery of new
knowledge, which is the essence of science!)
Let us next consider the special case in the testing of
hypotheses in which there are only two hypotheses, each held with
equal probability. This situation can be analyzed with the
Neyman-Pearson framework. But there does not seem to be any
conceptual difference between the Neyman-Pearson approach and the
Bayesian outlook with the two p's =.5. Indeed, the Neyman-Pear-
son framework can be seen as an analysis of the likelihoods of
the two hypothesized universes conditional upon the sample evi-
dence.
When we move to the more conventional hypothesis-testing
situation - that which John Arbuthnott analyzed in 1718 (see
Chapter 00) - we find a situation in which only the null hypothe-
sis is fully specified (p of a male = .5, for Arbuthnott).
Unlike the Neyman-Pearson case, the alternative hypothesis is not
specified in this case. True, we may say that observing the
evidence to be not consistent with the null hypothesis, and
consequently rejecting it, is tantamount to accepting the alter-
native hypothesis. But this does not mean that we are committed
to the particular parameter that was observed (or any other) as
being the "alternative" value, nor do we put a probability on an
alternative magnitude; rather, we are simply committed to con-
cluding that the parameter in which we are interested is not the
null hypothetical value. It is this absence of fully-specified
alternatives that makes it impossible to construct a closed
sample space, and that distinguishes the core of statistical
inference from problems in pure probability - at least in my
view. Of course one may refer to the Neyman-Pearson analysis as
statistical inference if one chooses, on the grounds that the
conclusion is conditional upon the sample evidence (as is the
case with the puzzles above). But using the term "statistical
inference" in this context is a matter of personal taste in
terminology, and should not obscure the fact that the grand aim
of Bayes and Laplace cannot be realized.
Consider the example of two sets of cholesterol
measurements, one set for a group that received treatment T and
the other for a group that received placebo P. One can infer
whether it is likely that the T set came from the same universe
as the P measurements by investigating the probability that the
same universe would produce two samples as different as were
observed, and one might even boldly speak about the probability
that the two came from the same universe. But if one finds that
the chance that the same universe would produce both sets is very
small, one cannot assign a probability to the proposition that
the observed sample parameter or any other parameter is the
parameter of the T universe; the result of the hypothesis test -
the probability that a single universe would produce both sets of
data - casts no light on this proposition. One can speak about
the probability (likelihood) that any particular universe would
produce the observed T data, and even state that a universe with
the sample parameter is the universe with the highest probability
(maximum likelihood). But this relative statement is not at all
the absolute probability that Bayes and Laplace sought.
Exploring the diagram in Figure II-5-1 may help one under-
stand why the original aim of Bayes and Laplace to calculate
inverse probabilities in science usually is impossible. Imagine
urns A, B, and C with 600, 300, and 100 tiny balls respectively,
corresponding to initial Markov state probabilities dj of .6, .3,
and .1. Each urn has pipes to urns W, X, Y, and Z. The transi-
tional probabilities pjk that a ball from (say) A will go through
its pipes to W, X, Y, and Z are .2, .4, .1., and .3 respectively.
Assume that the balls are marked secretly with their origins.
You examine a ball in W. What is the probability that it came
from A?
Figure II-5-1
Given the quantities of balls in A, B, and C (which
correspond to the probabilities of being in the initial Markovian
states), and the transition probabilities from A to W, X, Y, and
Z (both sets of items given above), and using Bayes' rule, one
can compute the probability sought. The explanation is far from
intuitively easy, however.
Let us first notice that we can deduce the quantities of
balls that will have the outcomes of being in W, X, Y, and Z from
the initial states and the transitional probabilities. For
example, if p(W!A) = pAW = .1, the quantity of balls going from A
to W equals .6 *.1 = .06. For each outcome urn the total is
simply the sum of the transition probabilities to it from each
initial-state urn times the quantity of balls in that urn. The
outcome-state probabilities dk are the ratios of these
quantities, summing to 1.0.
Now let us ask about the "inverse probability" (a term that
even Feller uses in this context; p. 341) that a given ball in W
came from A; call it q(A!W) or qWA, in Feller's notation. This
clearly is not the same as the probability that a ball from A
will go to W; such a probability depends only on the transition
probability p(W!A) which we can also write pAW. Nor is it the
same as the probability p(W, A) that a ball (among all the balls
in A, B, and C) will be first in A and then in W.
The inverse probability q(A!W) = qWA, =
(dA/dW)/p(W!A).
From this equation we see that calculating the inverse
probability requires knowing the probabilities of the initial
states as well as the probabilities of the outcome states, or to
put it differently, the probabilities of the initial states and
all the transition probabilities(the probabilities of the outcome
states being known if we know the initial-state probabilities and
all the transition probabilities). If dW = .3, then qWA =
(.6/.3) * .1 = .2, which is different than pAW which equals .06.
If W were larger relative to X, Y, and Z - which could only be
the case if pAW were larger than the assumed .1 (assuming A
remains the same) - then pAW would be larger. So we see that for
inverse probabilities to be calculated, the sizes of all initial
and outcome states and the transition probabilities must be known
- that is, a completely closed system.
To repeat, without knowing the probabilities of the initial
states, we cannot calculate inverse probabilities. But the
probabilities of the initial states are seldom known in science,
as I have already discussed; science seldom deals with closed
systems except perhaps sometimes (maybe) in physics, chemistry
and genetics. Hence the dream of calculating the probability of
a given possible origin of a set of data from the data themselves
is impossible or unreasonable in most situations.
The absence of a closed sample space in an inferential
situation also leads to the difficulties in the interpretation of
the probabilistic results that constitute the hallmark of
statistical inference. Again, in the absence of specified prior
alternatives, it is not possible to put a probability on the
likelihood that the observed sample came from a particular
universe - the very aim of Bayes and Laplace which has been the
holy grail for statisticians, and a source of frustration of
beginning students of statistics. We must recognize that the
search for a way to identify causes from data alone, without the
introduction of judgments other than probability distributions,
is just a mirage - though a mirage that we may continue to pursue
because it involves some deep-seated psychological need.
(The human yearning for a calculus that will assign
probabilities to causes probably is so strong a desire that no
arguments can eradicate it completely. Fisher speaks about "the
desire felt for such probability statements" ([1973/1990, pp. 60-
61])
The argument in this section dovetails with the discussion
of causality in Chapter 00 which emphasizes that a judgment of
causality in any research situation cannot be made automatically
from the observed correlations but must be made in the light of
what is known theoretically.
Now I switch directions and argue the great importance of
Bayes' rule in statistical inference.
Reasons have been adduced above to be convincing, I think,
that Bayes' rule does not provide an automatic, judgment-free
device to ascertain causes or to assign "inverse probability";
such a device is impossible in principle. But Bayes' and
Laplace's work led to the corpus of our modern knowledge about
statistical inference. While we cannot do quite what Bayes and
Laplace hoped to do, statisticians have developed more modest
concepts and devices that - when used with wisely-chosen
auxiliary assumptions - provide quantitative assessments of the
reliability of parameters and enable us to reach conclusions
about whether hypotheses about sameness and difference of various
treatments are reasonable. In every case these are conclusions
built upon "direct" probabilities from ordinary probability
models though they are embedded in the structure of an analysis
which constitutes statistical inference. (Discussion of these
concepts and devices may be found in Chapters 00.) In this sense
it may be reasonable to apply terms such as "Copernican
revolution".
CONFIDENCE INTERVALS AND BAYESIAN ANALYSIS
Unlike the search for the probabilities of causes, it seems
to me that Bayesian thinking can often be valuable in
constructing confidence intervals. If one states one's prior
beliefs about the distribution of the parameter in question, and
then combines that distribution with the observed data, there is
nothing mysterious or ambiguous about stating the posterior
distribution of belief, which can then be considered as the stuff
of a confidence interval. Therefore, Bayesian analyis can serve
well to shine clear sunlight on this murky concept. And even if
one wishes to state an extremely "uninformative" prior
distribution - that is, a state of affairs when one asserts close
to no knowledge at all - the Bayesian procedure is admirably
clear and consistent, pulling no rabbits from a hat. An
illustration (using data from Box and Tiao) may be found in
Chapter 00.
One need not even do anything differently than standard
confidence-interval calculations, to get the benefit of Bayesian
analysis. One may simply interpret the results in the Bayesian
fashion so as to obtain meaningful statements.
BAYESIAN JUDGMENT IN SCIENCE.
This section intertwines with the comments in Chapter I-5
about Bayesian analysis and the use of judgment in statistical
inference.
It is in scientific research rather than in decision-making
analyses that the inclusion of one's prior probabilities is open
to argument, and not on grounds of the theory of statistics but
on grounds of wise scientific practice. The issue is not the
appropriateness of subjective judgment in the statistical proc-
ess; I believe that the arguments in Chapter I-5 are overwhelming
that the use of judgment is unavoidable and ever-present. Rath-
er, in my view the issue is whether there are some situations
where it is most wise to conduct the analysis as if you are
entirely agnostic about the likely outcome, on the grounds that
proceeding with as open a mind as possible is the most efficient
way to make scientific progress as well as to achieve fruitful
interaction among researchers working in the general field.
The assignment of prior probabilities is exactly the
introduction of theoretical knowledge the absence of which was
seen earlier to be fatal to the use of the n/(n+1) formula in the
case where the trials have all had the same outcome. And no
sensible person would argue that when a decision is made one
should not use all the relevant knowledge. But this does not
imply that theoretical knowledge should always be allowed to
affect the conclusion drawn from a particular set of empirical
data.
One economist passes along a joking anecdote about a
colleague who models microeconomic agents in his research as
employing Bayesian analysis, but who himself forswears the
technique in analyzing his results. "He models all his agents as
knowing Bayesian decision theory even though he [himself]
doesn't" (Rust, 1988, p. 146). And the writer then refers to
this as "hypocrisy" (though not of a simple kind). This supposed
inconsistency is given as an argument for the use of Bayesian
analysis by the source of the anecdote.
It is not the lack of a sense of humor, I think, that causes
me to see no inconsistency in that sort of behavior. In the case
of a person making an economic decision about the purchase of a
used car, or a physician deciding how to treat a patient or
whether to screen for a disease, Bayesian thinking may be quite
appropriate; but for the scientist modeling or analyzing exactly
the same behavior, Bayesian analysis may not be appropriate.
The common argument against the use of Bayesian priors is
that the scientist, consciously or unconsciously, may choose them
in a fashion that will bolster the desired result - and it is
most unlikely that the researcher does not have a desire to see
one result rather than another. Additionally, once one has seen
the empirical analysis apart from the Bayesian analysis, one
needs a great deal of self-discipline to ensure that the priors
should not be affected by the data, no matter how conscientious
the researcher is in making "priors" prior to the analysis.
The use of ingenuity may be able to surmount these
difficulties at least in part - particularly the use of multiple
priors in a sort of sensitivity analysis, and putting the priors
into some sort of a "bank" prior to the empirical work. At the
least, there are some situations where such safeguards could make
Bayesian analysis worth doing in science.
Yet I believe that proceeding without explicit priors will
usually be the best practice because it separates the various
parts of the research process in a useful way. An example from
my own field may best make my argument: Consider the question of
whether population growth has a negative effect upon economic
development. Just about all empirical studies do not find a
significant correlation between those two variables under a wide
variety of conditions and specifications. A Bayesian might do
explicitly what many enthusiasts for population control do
implicitly - find so much theoretical reason to believe ex ante
in the causal relationship that one will finally conclude a
posteriori that a correlation exists and that it fails to show up
empirically because of hidden variables working in the other
direction or muddying the matter. But if one's theoretical
belief is so strong ex ante, why bother to do the empirical
research at all?
Indeed, the greatest value of empirical research is when it
contradicts strong beliefs based on a priori theorizing. And if
the reported results of such work are influenced by the prior
probabilities, it mitigates the power of the empirical research
to cause reexamination of the theoretical framework.
Attempting to keep prior knowledge of earlier work, and
other sources of speculation about the results, as far as
possible from the empirical work is in the same spirit as is the
double-blind experiment in biostatistics, though there are also
significant differences between the two ideas.
The example at hand is of particular interest because some
researchers and activists have argued that the finders of a zero
correlation have an obligation to prove that their finding is not
an artifact. But this is exactly the opposite of the basic
thrust of science; the usual null hypothesis is to presume that
there is no effect unless convincing evidence is shown in its
favor - because between two randomly-chosen phenomena there is
not likely to be a significant relationship. The better approach
would be to consider the empirical findings meaningful and call
for an examination of the prior beliefs, instead of letting the
prior beliefs dominate the conclusions.
The standard scientific and statistical practice of consid-
ering no effect as the point of departure and the target for
disproof - the null hypothesis - is sound doctrine, in my view.
Society employs biomedical scientists to find treatments that
have an effect rather than treatments that do not have an effect
(though there are unusual cases where the aim is to find a sub-
stance that will be neutral and have no effects); it is easy to
find treatments that will not work. We pay chemists to find
compounds that work, psychologists to find teaching methods that
are effective, and so on. We want this causal knowledge because
it helps us control our world. And finding this knowledge of an
effect is difficult; it is tough to find important, valid, causal
relationships, and there are relatively few of them, whereas
finding examples of nothing happening is easy. And it is rela-
tively easy to disprove the claim that something works, because
one knows exactly which variable to examine; in contrast, it is
difficult if not impossible to prove that there is no effect,
because there are an infinite number of possible combinations of
variables that one would have to examine.
So it is reasonable that the burden of proof is on the claim
that there is an effect, just as in Anglo-Saxon courts the burden
of proof is on the claim that the defendant is guilty, for much
the same technical reasons (though the moral situations are
different).
However, the basic presumption of a zero correlation is at
odds with Bayesian practice, I understand it.
In some branches of statistics - perhaps preeminently,
econometrics - there is a large collection of safeguards against
spurious correlations, and against invalidly interpreting a
correlation as causal. But no one feels the need to provide
safeguards against spurious non-correlations or against
interpreting a non-correlation as non-causal.
It would seem entirely in accord with the underlying
judgmental Bayesian spirit to assert that there are some
situations where one should make the judgment that it is sound
practice to ignore prior theoretical and empirical knowledge.
It may be illuminating to contrast the Bayesian search for
"pure" scientific knowledge with the application of Bayes' rule
in a business situation. A standard textbook example mentioned
earlier is a decision to produce a new product where there is a
prior probability distribution for (say) the demand for the
product, and where market research can add information and
increase knowledge about that demand; Bayes' rule produces a
sharper posterior probability distribution than does the prior
distribution and thereby presumably improves the quality of the
decision-making. But such a decision-making situation is
concerned with only the situation at hand, whereas a scientific
study is concerned with the entire corpus of knowledge; this is
perhaps as good a way as any to distinguish between "pure" and
"applied" research.
CONCLUSION
The origin of Bayes' rule was in the search for causes - the
process called inverse probability. With the passage of time,
probability theory has made clear that Bayes' rule is appropriate
for situations where the sample space can be entirely specified -
that is, situations where one considers that such full
specification is warranted. But this is not the case in most
scientific research; usually there are many possible values of
the population parameter of interest. Hence the search for a
tool to automatically identify causes must necessarily fail
except in situations where the alternatives can be specified
(such as the situations Neyman-Pearson analysis deals with), and
decision-making situations in which Bayes' rule is extremely
helpful in finding counter-intuitive answers.
The most controversial contemporary issue involving Bayes'
rule concerns its applicability to inferential tasks of
hypothesis testing and the estimation of reliability by
combining theoretical knowledge with sample results. There is no
doubt that research can never been wholly objective. But whether
it is most wise to examine and analyze the data as separately as
possible from theoretical background is best thought of as a
matter of research taste, style, and wisdom. I believe that in
most cases this process would be unwise.
This matter is discussed in a slightly different context in
Chapter 00 on causality, wherein it is argued that auxiliary
theory is needed to distinguish a causal relationship from a
correlation which has other claims to being called causal.
Procedures for bayesian estimation in a variety of
situations are presented in Chapter 00.
**FOOTNOTES**
[1]: Throughout this book I have wherever possible left
aside issues of valuation, partly because I think that I can get
on with much of the business at hand here without such
consideration and partly because the complexity of the subject
has increased greatly in past decades (which is much to the
good) and the issues are now at a high point of unsettledness;
perhaps this will long continue to be so, and for good reason.
For background see Bell, Raiffa, and Tversky (1988), and
especially the work there of Shafer. Shafer's point of view is
in the same spirit as is this paper, because he suggests
different procedures for different sorts of situations.
NOTES
**ENDNOTES**
<1>:I would greatly enjoy debating this proposition with anyone
who chooses to dispute it.