Approach 3: A Bayesian Approach Approach 2 is in the Bayesian spirit in that it asks about probabilities of the observed data conditional upon one or more particular universes. This has the virtue of being quite unambiguous in interpretation. But one can go even further in this direction, as follows: Mark off a set of universes on each side of the sample mean, with centroids at equal distances from each other. Then perform the same operations for each that are specified for the universes mentioned in Approaches 1 and 2, and normalize the results. If the prior distribution is assumed to be uniform, the results will be the same as the standard confidence interval. But the interpretation will be different. There will be no attempt to make any statement about the unconditional probability of the mean of the universe. Rather, the result will be a statement about the mean of the universe conditional upon a uniform prior distribution and the sample evidence, which is unchallengeable logically. If the prior distribution is not uniform, appropriate adjustment can be made when normalizing, and again, the interpretation is not subject to question, though the assumption about the prior may be questioned. The problem of computation has always been a barrier to this sort of Bayesian interpretation. But if one simulates Bayesian probabilities, the difficult disppears. Here follows an example of such simulation. A confidence interval may immediately be read from the posterior distribution in standard Bayesian fashion. INSERT FROM STATSWRK3 BAYESNRM Figure II-3--1 W and W fig 8-4 (from Clopper and Pearson) From Encyclopedia of Statistics, "Confidence Interfals and Regions", pp. 120-121. ...we can make probability statements about X; e.g., Pr[mu - 1.96sigma _< X _< mu + 1.96 sigma] = 0.95. (1) We could rewrite this as Pr[X - 1.96 sigma _< mu _< X + 1.96 sigma] = 0.95 (2) or Pr[mu _E (X - 1.96 sigma, X + 1.96 sigma)] = 0.95. (3) Although mu may appear to be the subject of statements (2) and (3), the probability distribution referred to is that of X, as was more obvious in statement (1). If X is observed to be x, we say that we have 95% confidence that x - 1.96 sigma _< mu _< x + 1.96 sigma or say that (x - 1.96 sigma,x + 1.96 sigma) is a 95% confidence interval for mu. No probability statement is made about the proposition x - 1.96 sigma _< mu _< x + 1.96 sigma (4) involving the observed value, x, since neither x nor mu has a probability distribution. The proposition (4) will be either true or false, but we do not know which. If confidence intervals with confidence coefficient p were computed on a large number of occasions, then, in the long run, the fraction p of these confidence intervals would contain the true parameter value. (This is provided that the occasions are independent and that there is no selection of cases.) "Confidence Intervals and Regions," ------, pp. 120, 121. page 1 \statphil Chapter II-3 statconf 4-23-9623 CHAPTER II-3 POINT ESTIMATION AND CONFIDENCE INTERVALS I: THE LOGIC<1> This chapter discusses how to assess the accuracy of a point estimate of the mean, median, or other statistic of a sample. We want to know: How close is our estimate of (say) the sample mean likely to be to the population mean? It is all very well to say that on average the sample mean (or other point estimator) equals a population parameter. But what about the result of any particular sample? How accurate or inaccurate an estimate is it likely to produce? Early in the history of statistical inference this question arose in the practice of astronomy (see Stigler, 1986; Hald, 1990). The accuracy of an estimate is a hard intellectual nut to crack, so hard that for hundreds of years statisticians and scientists wrestled with the problem with little success; it was not until the last century or two that much progress was made. The kernel of the problem is learning the extent of the variation in the population. But whereas the sample mean can be used straightforwardly to estimate the population mean, the extent of variation in the sample does not directly estimate the extent of the variation in the population, because the variation differs at different places in the distribution, and there is no reason to expect it to be symmetrical around the estimate or the mean. The intellectual difficulty of confidence intervals may be one reason why they are less prominent in statistics literature and practice than are tests of hypotheses (though statisticians often favor confidence intervals). Another reason is that tests of hypotheses are more fundamental for pure science because they address the question that is at the heart of all knowledge-get- ting: "Should these groups be considered different or the same?" The statistical inference represented by confidence limits ad- dresses what seems to be a secondary question in most sciences (though not in astronomy or perhaps physics): "How reliable is the estimate?" Still, confidence intervals are very important in some applied sciences such as geology - estimating the variation in grades of ores, for example - and in some parts of business and industry. Confidence intervals and hypothesis tests are not disjoint ideas. Indeed, hypothesis testing of a single sample against a benchmark value is (in all schools of thought, I believe) opera- tionally identical with the most common way (Approach 1 below) of constructing a confidence interval and checking whether it in- cludes that benchmark value. But the underlying reasoning is different for confidence limits and hypothesis tests. The logic of confidence intervals is on shakier ground, in my judgment, than that of hypothesis testing, though there are many thoughtful and respected statisticians who argue that the logic of confidence intervals is better grounded and leads less often to error. Confidence intervals are considered by many to be part of the same topic as estimation - being an estimation of accuracy, in their view. And confidence intervals and hypothesis testing are seen as sub-cases of each other by some people. Whatever the importance of these distinctions among these intellectual tasks in other contexts, they need not concern us here. Confidence intervals - even if they are meaningful - certainly are controversial. The Encyclopedia of Statistics says: "Confidence intervals are widely used in practice, although not as widely supported by people interested in the foundatrions of statistics" (Vol 2, p. 126). Some statisticians will not even discuss the topic. For example, the index to the well-respected book on Basic Concepts of Probability and Statistics by Hodges and Lehman (1970) does not even have a listing for confidence intervals. And Savage in his The Foundations of Statistics (1954/1972) first says that "The doctrine of accuracy estimation is vague" (p. 257) and then later writes the same phrase except with "erroneous" instead of "vague" (p. 260). He goes on to say that "not being convinced myself, I am in no position to present convincing evidence for the usefulness of interval estimation" (p. 261). He describes Fisher's approach to the matter, fiducial probability, as the "most disputed technical concept of modern statistics" (p. 262), and says no more. He also refers to the supposedly-related idea of tolerance intervals as "slippery", so it is not surprising if the layperson finds the entire matter slippery. [1] One thing is undeniable, however: Despite the difficulty subtlety of the topic, the accuracy of estimates must be dealt with, one way or another. Philosophers seldom write about the subject, with the notable exception of Braithwaite (1953), who bases his treatment on Neyman and Pearson; his treatment is adventurous, yet sufficiently obscure to tax anyone's understanding. Because the logic of confidence intervals is subtle, most statistics texts skim right past the conceptual difficulties, and go directly to computation. And when the concept is combined with the conventional algebraic treatment, the composite is truly baffling; the formal mathematics makes impossible any intuitive understanding. For students, "pluginski" is the only viable option. With the resampling method, however, the mathematics of confidence intervals is easy. The statistical interpretation of the calculations then becomes a challenging and even pleasurable subject; even beginning undergraduates can enjoy the subtlety and find that it feels good to stretch the brain and get down to fundamentals, once the calculations become transparent. To preview the treatment of confidence intervals presented below, which I hope dissolves the confusion of the topic: We do not learn about the reliability of sample estimates of the mean (and other parameters) by logical inference from any one particular sample to any one particular universe, because this cannot be done in principle. Instead, we investigate the behavior of various universes in the neighborhood of the sample, universes whose characteristics chosen on the basis of their resemblances to the sample. In this way the estimation of confidence intervals is like all other statistical inference: One investigates the probabilistic behavior of one or more hypothesized universes, the hypotheses being implicitly suggested by the sample evidence but not logically implied by that evidence. The examples worked through below show why statistics is as difficult a subject as it is. The procedure required to transit successfully from the original question to the statistical probability and then interpretation of the probability involves a great many choices about the appropriate model based on analysis of the problem at hand; a wrong choice at any point dooms the procedure. The actual computation of the probability - whether done with formulaic probability theory or with resampling - is only a very small part of the procedure, and it is the least difficult part if one proceeds with resampling. The difficulties in the statistical process are not mathematical but rather stem from the hard clear thinking needed to understand the nature of the situation and to ascertain the appropriate way to model it. In comparison with the logic of hypothesis testing, the logic of confidence limits is more subtle (though it need not be as opaque as one would think it is from reading the philosophic and statistical literature (e.g. Braithwaite, 1953; Gigerenzer et. al., 1989). The difference, I think, is that in hypothesis- testing situations we find it relatively easy to decide which universes we wish to analyse, and therefore the deductive chain from the statistical question to the probabilistic study is short, clear, and strong. But when inquiring into the accuracy of estimations we find it much harder to decide which universes we wish to analyse. This is the core of the problem, and climbing up the deductive chain is therefore fraught with difficulty.[1] THE LOGIC OF CONFIDENCE INTERVALS The purpose of a confidence interval is to help us assess the reliability of one or more parameters of the sample - most often its mean or median - as an estimator of the parameter of the universe. If one draws a sample that is very very large - large enough so that one need not worry about sample size and dispersion in the case at hand, from a universe whose characteristics one knows, one then can deduce the probability that the sample mean will fall within a given distance of the population mean. Intui- tively, it seems as if one should also be able to reverse the process - to infer something about the location of the population mean from the sample mean. But this inverse inference turns out to be a slippery business indeed. Let's put it differently: It is all very well to say - as one logically may - that on average the sample mean (or other point estimator) equals a population parameter in most situa- tions. But what about the result of any particular sample? How accurate or inaccurate an estimate of the population mean is the sample likely to produce? The line of thought runs as follows: It is possible to map the distribution of the means (or other such parameter) of samples of any given size (the sample size of interest in any investigation usually being the size of the observed sample) and of any given pattern of dispersion (which we will assume for now can be estimated from the sample) that a universe in the neighborhood of the sample will produce. For example, we can compute how big an interval to the right of a postulated universe's mean will include 45 percent of the samples on one side of the mean and 45 percent on the other side. What cannot be done is to draw conclusions from sample evidence about the nature of the universe from which it was drawn, in the absence of some information about the set of uni- verses from which it might have been drawn. That is, one can investigate the behavior of one or more specified universes, and discover the absolute and relative likelihoods that the given specified universe(s) might produce such a sample. But the universe(s) to be so investigated must be specified in advance (which is consistent with the Bayesian view of statistics). To put it differently, we can employ probability theory to learn the pattern(s) of results produced by samples drawn from a particular specified universe, and then compare that pattern to the observed sample. But we cannot infer the probability that that sample was drawn from any given universe in the absence of knowledge of the other possible sources of the sample. That is a subtle differ- ence, but hopefully the following discussion makes it understandable. COMPUTING CONFIDENCE INTERVALS In the first part of the discussion we shall leave aside the issue of estimating the extent of the dispersion - a troublesome matter, but one which seldom will result in unsound conclusions even if handled crudely. To start from scratch again: The first - and seemingly straightforward - step is to estimate the mean of the population based on the sample data. The next and more complex step is to ask about the range of values (and their probabilities) that the estimate of the mean might take - that is, the construction of confidence intervals. It seems natural to assume that if our best guess about the population mean is the value of the sample mean, our best guesses about the various values that the population mean might take if unbiased sampling error causes discrepancies between population parameters and sample statistics, should be values clustering around the sample mean in a symmetrical fashion (assuming that asymmetry is not forced by the distribution - as for example, the binomial is close to symmetric near its middle values). But how far away from the sample mean might the population mean be? Let's walk slowly through the logic, going back to basics to enhance intuition. Let's start with the familiar saying, "The apple doesn't fall far from the tree." Imagine that you are in a very hypothetical place where an apple tree is above you, and you are not allowed to look up at the tree, whose trunk has an infinitely thin diameter. You see an apple on the ground. You must now guess where the trunk (center) of the tree is. The obvious guess for the location of the trunk is right above the apple. But the trunk is not likely to be exactly above the apple because of the small probability of the trunk being at any particular location, due to sampling dispersion. Though you find it easy to make a best guess about where the mean is (the true trunk), with the given information alone you have no way of making an estimate of the probability that the mean is one place or another, other than that the probability is the same that the tree is to the north or south, east or west, of you. You have no idea about how far the center of the tree is from you. You cannot even put a maximum on the distance it is from you, and without a maximum you could not even reasonably assume a rectangular distribution, or a Normal distribution, or any other. Next you see two apples. What guesses do you make now? The midpoint between the two obviously is your best guess about the location of the center of the tree. But still there is no way to estimate the probability distribution of the location of the center of the tree. Now assume you are given still another piece of information: The outermost spread of the tree's branches (the range) equals the distance between the two apples you see. With this information, you could immediately locate the boundaries of the location of the center of the tree. But this is only because the answer you sought was given to you in disguised form. You could, however, come up with some statements of relative probabilities. In the absence of prior information on where the tree might be, you would offer higher odds that the center (the trunk) is in any unit of area close to the center of your two apples than in a unit of area far from the center. That is, if you are told that either one apple, or two apples, came from one of two specified trees whose locations are given, with no reason to believe it is one tree or the other (later, we can put other prior probabilities on the two trees), and you are also told the dispersions, you now can put relative probabilities on one tree or the other being the source. (This is like the Neyman-Pearson procedure, and it is easily reconciled with the Bayesian point of view to be explored later. One can also connect this concept of relative probability to the Fisherian concept of maximum likeli- hood - which is a likelihood relative to all others). And you could list from high to low the probabilities for each unit of area in the neighborhood of your apple sample. But this proce- dure is quite different from making any single absolute numerical probability estimate of the location of the mean. Now let's say you see 10 apples on the ground. Of course your best estimate is that the trunk of the tree is at their arithmetic center. But how close to the actual tree trunk (the population mean) is your estimate likely to be? This is the question involved in confidence intervals. We want to estimate a range (around the center, which we estimate with the center mean of the sample, we said) within which we are pretty sure that the trunk lies. To simplify, we consider variation along only one dimension - that is, on (say) a north-side line rather than on two (the entire surface). We first note that you have no reason to estimate the trunk's location to be outside the sample pattern, or at its edge, though it could be so in principle. If the pattern of the 10 apples is tight, you imagine the pattern of the likely locations of the population mean to be tight; if not, not. That is, it is intuitively clear that there is some connection between how spread out are the sample observations and your confidence about the location of the population mean. For example, consider two patterns of a thousand apples, one with twice the spread of another, where we measure spread by (say) the diameter of the circle that holds the inner half of the apples for each tree, or by the standard deviation. It makes sense that if the two patterns have the same center point (mean), you would put higher odds on the tree with the smaller spread being within some given distance - say, a foot - of the estimated mean. But what odds would you give on that bet? THE TWO APPROACHES TO ESTIMATING CONFIDENCE INTERVALS There are two broad conceptual approaches to the question at hand: 1) Study the probability of various distances between the sample mean and the likeliest population mean; and 2) study the behavior of particular border universes. Computationally, both approaches often yield the same result, but their interpretations differ. Approach 1 follows the conventional logic although carrying out the calculations with resampling simulation. Approach 1: The Conventional Logic for a Confidence Interval: The Distance Between Sample and Population Mean If the study of probability can tell us the likelihood that a given population will produce a sample with a mean at a given distance x from the population mean, and if a sample is an unbiased estimator of the population, then it seems natural to turn the matter around and interpret the same sort of data as telling us the probability that the estimate of the population mean is that far from the "actual" population mean. A fly in the ointment is our lack of knowledge of the dispersion, but we can safely put that aside for now. (See below, however). This first approach begins by assuming that the universe that actually produced the sample has the same amount of dispersion (but not necessarily the same mean) that one would estimate from the sample. One then produces (either with resampling or with Normal distribution theory) the distribution of sample means that would occur with repeated sampling from that designated universe with samples the size of the observed sample. One can then compute the distance between the (assumed) population mean and (say) the inner 45 percent of sample means on each side of the actually-observed sample mean. The crucial step is to shift vantage points. We look from the sample to the universe, instead of from a hypothesized universe to simulated samples (as we have down so far). This same interval as computed above must be the relevant distance as when one looks from the sample to the universe. Putting this algebraically, we can state (on the basis of either simulation or formal calculation) that for any given population S, and for any given distance d from its mean mu, that p[(mu - xbar) < d] = alpha, where xbar is a randomly-generated sample mean and alpha is the probability resulting from the simulation or calculation. The above equation focuses on the deviation of various sample means (xbar) from a stated population mean (mu). But we are logically entitled to read the algebra in another fashion, focusing on the deviation of mu from a randomly-generated sample mean. This implies that for any given randomly-generated sample mean we observe, the same probability (alpha) describes the probability that mu will be at a distance d or less from the observed xbar. (I believe that this is the logic underlying the conventional view of confidence intervals, but I have yet to find a clear-cut statement of it; in any case, it appears to be logically correct.) To repeat this difficult idea in slightly different words: If one draws a sample (large enough to not worry about sample size and dispersion), one can say in advance that there is a probability p that the sample mean (xbar) will fall within z standard deviations of the population mean (mu). One estimates the population dispersion from the sample. If there is a probability p that xbar is within z standard deviations of mu, then with probability p, mu must then be within that same z standard deviations of xbar. To repeat, this is, I believe, the heart of the standard concept of the confidence interval, to the extent that there is thought-through consensus on the matter. So we can state for such populations the probability that the distance between the population and sample means will be d or less. Or with respect to a given distance, we can say that the probability that the population and sample means will be that close together is p. That is, we start by focusing on how much the sample mean diverges from the known population mean. But then - and to repeat once more this key conceptual step - we re-focus our attention to begin with the sample mean and then discuss the probability that the population mean will be within a given distance. The resulting distance is what we call the "confidence interval". Please notice that the distribution (universe) assumed at the beginning of this approach did not include the assumption that the distribution is centered on the sample mean or anywhere else. It is true that the sample mean is used for purposes of reporting the location of the estimated universe mean. But despite how the subject is treated in the conventional approach, the estimated population mean is not part of the work of constructing confidence intervals. Rather, the calculations apply in the same way to all universes in the neighborhood of the sample (which are assumed, for the purpose of the work, to have the same dispersion). And indeed, it must be so, because the probability that the universe from which the sample was drawn is centered exactly at the sample mean is very small. This independence of the confidence-intervals construction from the mean of the sample (and the mean of the estimated universe) is surprising at first, but after a bit of thought it makes sense. In this first approach, as noted more generally above, we do not make estimates of the confidence intervals on the basis of any logical inference from any one particular sample to any one particular universe, because this cannot be done in principle; it is the futile search for this connection that for decades roiled the brains of so many statisticians and now continues to trouble the minds of so many students. Instead, we investigate the behavior of (in this first approach) the universe that has a higher probability of producing the observed sample than does any other universe (in the absence of any additional evidence to the contrary), and whose characteristics are chosen on the basis of its resemblance to the sample. In this way the estimation of confidence intervals is like all other statistical inference: One investigates the probabilistic behavior of one or more hypothesized universes, the universe(s) being implicitly suggested by the sample evidence but not logically implied by that evidence. And there are no grounds for dispute about exactly what is being done - only about how to interpret the results. One difficulty with the above approach is that the estimate of the population dispersion does not rest on sound foundations; this matter will be discussed later, but it is not likely to lead to a seriously misleading conclusion. A second difficulty with this approach is in interpreting the result. What is the justification for focusing our attention on a universe centered on the sample mean? While this particular universe may be more likely than any other, it undoubtedly has a low probability. And indeed, the statement of the confidence intervals refers to the probabilities that the sample has come from universes other than the universe centered at the sample mean, and quite a distance from it. My answer to this question does not rest on a set of meaningful mathematical axioms, and I assert that a meaningful axiomatic answer is impossible in principle. Rather, I reason that we should consider the behavior of this universe because other universes near it will produce much the same results, differing only in dispersion from this one, and this difference is not likely to be crucial; this last assumption is all- important, of course. True, we do not know what the dispersion might be for the "true" universe. But elsewhere (Chapter 00 in [Statphil]) I argue that the concept of the "true universe" is not helpful - or maybe even worse than nothing - and should be forsworn. And we can postulate a dispersion for any other universe we choose to investigate. That is, for this postulation we unabashedly bring in any other knowledge we may have. The defense for such an almost-arbitrary move would be that this is a second-order matter relative to the location of the estimated universe mean, and therefore it is not likely to lead to serious error. (This sort of approximative guessing sticks in the throats of many trained mathematicians, of course, who want to feel an unbroken logic leading backwards into the mists of axiom formation. But the axioms themselves inevitably are chosen arbitrarily just as there is arbitrariness in the practice at hand, though the choice process for axioms is less obvious and more hallowed by having been done by the masterminds of the past. (See Chapter 00 in [Statphil] on the necessity for judgment.) The absence of a sequence of equations leading from some first principles to the procedure described in the paragraph above is evidence of what is felt to be missing by those who crave logical justification. The key equation in this approach is formally unassailable, but it seems to come from nowhere.) In the examples in the following chapter may be found computations for two population distributions - one binomial and one quantitative - of the histograms of the sample means produced with this procedure. Operationally, we use the observed sample mean, together with an estimate of the dispersion from the sample, to estimate a mean and dispersion for the population. Then with reference to the sample mean we state a combination of a distance (on each side) and a probability pertaining to the population mean. The computational examples will illustrate this procedure. Once we have obtained a numerical answer, we must decide how to interpret it. There is a natural and almost irresistible tendency to talk about the probability that the mean of the universe lies within the intervals, but this has proven confusing and controversial. Interpretation in terms of a repeated process is not very satisfying intuitively [1]. In my view, it is not worth arguing about any "true" interpretation of these computations. One could sensibly interpret the computations in terms of the odds a decision-maker, given the evidence, would reasonably offer about the relative likelihoods that the sample came from one of two specified universes (one of them probably being centered on the sample); this does provide some information on reliability, but this procedure departs from the concept of confidence intervals. The reader may find it useful to read in the next chapter examples of the actual practice of computing confidence intervals in Approach 1, before proceeding to read about Approach 2. Approach 2. A Relevant Method Though Not a Confidence Interval: Likelihood of Various Universes Producing This Sample There is another simple method for getting an impression of the location of the sample with respect to the universe that generated it; it is not the same as a confidence interval[1], but it can be illuminating. We can simply pick any particular location and state the probability that a given universe located at that point would produce a sample with a mean as far or farther away than the observed sample. This method does not require any assumptions about the locations of universes. But it clearly does not allow one to state a probability that the sample came from any particular universe or set of universes within any particular interval. The second approach to the general question of estimate accuracy is to analyze the behavior of a variety of universes centered at other points on the line, rather than the universe centered on the sample mean. One can ask the probability that a distribution centered away from the sample mean, with a given dispersion, would produce (say) a 10-apple scatter having a mean as far away from the given point as the observed sample mean. If we assume the situation to be symmetric 1[2], we can find a point at which we can say that a distribution centered there would have only a (say) 5 percent chance of producing the observed sample. And we can also say that a distribution even further away from the sample mean would have an even lower probability of producing the given sample. But we cannot turn the matter around and say that there is any particular chance that the distribution that actually produced the observed sample is between that point and the center of the sample. Imagine a situation where you are standing on one side of a canyon, and you are hit by a baseball, the only ball in the vicinity that day. Based on experiments, you can estimate that a baseball thrower who you see standing on the other side of the canyon has only a 5 percent chance of hitting you with a single throw [1]. But this does not imply that the source of the ball that hit you was someone else standing in the middle of the canyon, because that is patently impossible. That is, your knowledge about the behavior of the "boundary" universe does not logically imply anything about the existence and behavior of any other universes. But just as in the discussion of testing hypotheses, if you know that one possibility is unlikely, it is reasonable that as a result you will draw conclusions about other possibilities in the context of your general knowledge and judgment. We can find the "boundary" distribution(s) we seek if we a) specify a measure of dispersion, and b) try every point along the line leading away from the sample mean, until we find that distribution that produces samples such as that observed with a (say) 5 percent probability or less. To estimate the dispersion, in many cases we can safely use an estimate based on the sample dispersion, using either resampling or Normal distribution theory. The hardest cases for resampling are a) a proportion near 0.1 or 1.0, and b) a very small sample of data. In such situations one should use additional outside information, or Normal distribution theory, or both. We can also create a confidence interval in the following fashion: We can first estimate the dispersion for a universe in the general neighborhood of the sample mean, using various devices to be "conservative", if we like.[1] Given the estimated dispersion, we then estimate the probability distribution of various amounts of error between observed sample means and the population mean. We can do this with resampling simulation as follows: a) Create other universes at various distances from the sample mean, but with other characteristics similar to the universe that we postulate for the immediate neighborhood of the sample, and b) experiment with those universes. One can also apply the same logic with a more conventional parametric approach, using general knowledge of the sampling distribution of the mean, based on Normal distribution theory or previous experience with resampling. We shall not discuss the latter method here. As with approach 1, we do not make any probability statements about where the population mean may be found. Rather, we discuss only what various hypothetical universes might produce, and make inferences about the "actual" population's characteristics by comparison with those hypothesized universes. If we are interested in (say) a 95 percent confidence interval, we want to find the distribution on each side of the sample mean that would produce a sample with a mean that far away only 2.5 percent of the time (2 * .025 = 1 - .95). A shortcut to find these "border distributions" is to plot the sampling distribution of the mean at the center of the sample, as in Approach 1. Then find the (say) 2.5 percent cut-offs at each end of that distribution. On the assumption of equal dispersion at the two points along the line, we now reproduce the previously- plotted distribution with its centroid (mean) at those 2.5 percent points on the line. The new distributions will have 2.5 percent of their areas on the other side of the mean of the sample. So from the standpoint of Approach 2, the conventional sample formula (e. g. Wonnacott and Wonnacott, 1990, p. 5) which is centered at the mean can be considered a shortcut to estimating the boundary distributions. We say that the boundary is at the point that centers a distribution which has only a (say) 2.5 percent chance of producing the observed sample; it is that distribution which is the subject of the discussion - that is, one of the distributions at the endpoints of the vertical line in Figure II-3-1 - and not the distribution which is centered at mu = xbar. [1] Figure II-3--1 To restate, then: moving progressively farther away from the sample mean, we can eventually find a universe that has only some (any) specified small probability of producing a sample like the one observed. One can then say that this point represents a "limit" or "boundary" so that the interval between it and the sample mean may be called a confidence interval. Interpretation of Approach 2 Now to interpret the results of the second approach: Assuming that the sample is not drawn in a biased fashion (such as the wind blowing all the apples in the same direction), and assuming that the population has the same dispersion as the sample, we can say that distributions centered at the 95 percent confidence points (each of them including a tail with 2.5 percent of the area), or even further away from the sample mean, will produce the observed sample only 5 percent of the time or less. The result of the second approach is more in the spirit of a hypothesis test than of the usual interpretation of confidence intervals. Another statement of the result of the second approach is: We postulate a given universe - say, a universe at (say) the two-tailed 95 percent boundary line. We then say: The probability that the observed sample would be produced by a universe with a mean as far (or further) from the observed sample's mean as the universe under investigation is only 2.5 percent. This is similar to the prob-value interpretation of a hypothesis-test framework. It is not a direct statement about the location of the mean of the universe from which the sample has been drawn. But it is certainly reasonable to derive a betting-odds interpretation of the statement just above, to wit: the chances are 2 1/2 in 100 (or, the odds are 2 1/2 to 97 1/2) that a population located here would generate a sample with a mean as far away as the observed sample. And it would seem legitimate to proceed to the further betting-odds statement that (assuming we have no additional information) the odds are 97 1/2 to 2 1/2 that the mean of the universe that generated this sample is no farther away from the sample mean than the mean of the boundary universe under discussion. About this statement there is nothing slippery, and its meaning should not be controversial. Here again the tactic for interpreting the statistical procedure is to restate the facts of the behavior of the universe that we are manipulating and examining at that moment. We use a heuristic device to find a particular distribution - the one that is at (say) the 97 1/2 - 2 1/2 percent boundary - and simply state explicitly what the distribution tells us implicitly: The probability of this distribution generating the observed sample (or a sample even further removed) is 2 1/2 percent. We could go on to say (if it were of interest to us at the moment) that because the probability of this universe generating the observed sample is as low as it is, we "reject" the "hypothesis" that the sample came from a universe this far away or further. Or in other words, we could say that because we would be very surprised if the sample were to have come from this universe, we instead believe that another hypothesis is true. The "other" hypothesis often is that the universe that generated the sample has a mean located at the sample mean or closer to it than the boundary universe. The behavior of the universe at the 97 1/2 - 2 1/2 percent boundary line can also be interpreted in terms of our "confidence" about the location of the mean of the universe that generated the observed sample. We can say: At this boundary point lies the end of the region within which we would bet 97 1/2 to 2 1/2 that the mean of the universe that generated this sample lies to the (say) right of it. As noted in the preview to this chapter, we do not learn about the reliability of sample estimates of the population mean (and other parameters) by logical inference from any one particular sample to any one particular universe, because in principle this cannot be done. Instead, in this second approach we investigate the behavior of various universes at the borderline of the neighborhood of the sample, the characteristics of those universes being chosen on the basis of their resemblances to the sample. We seek, for example, to find the universes that would produce samples with the mean of the observed sample less than (say) 5 percent of the time. In this way the estimation of confidence intervals is like all other statistical inference: One investigates the probabilistic behavior of hypothesized universes, the hypotheses being implicitly suggested by the sample evidence but not logically implied by that evidence. Approaches 1 and 2 may (if one chooses) be seen as identical conceptually as well as (in many cases) computationally. But as I see it, the interpretation of them is rather different, and distinguishing them helps one's intuitive understanding. Approach 3: A Simulation Method Here is another new method: We can simulate the behavior of a variety of universes at different distances from us. As one thinks about the concept of confidence interval, it turns out to be either very hard or impossible to get a clear idea of what others are talking about or the meaning of the mathematical operations they perform in connection with that concept - as shown in the quotes from various skeptical statisticians above. To clarify the matter, and also as a practical expedient, I propose a way of defining confidence intervals - or a concept that grasps some of that idea, which we might call an accuracy interval - that has clarified many other difficult concepts (e.g. relativity), but so far as I can tell, has not been employed with confidence intervals: operational definition. To use a the physical example of estimating the accuracy of the estimate of the location of the trunk of an apple tree to illustrate the logic: We may base our estimate of the spread of the fall of apples from apple trees on the actual sample that we have, and then examine how often a sample of, say, ten apples from such a tree would have a mean as far to the right as we are standing. And this is indeed how we can proceed -- trying out simulated trees at differing distances. These are the operational steps I suggest that one would perform to compute a confidence-like accuracy interval in a particular simple case: 1. Mark off the narrowest area of the observed sample distribution that contains 10 percent of the probability density; in the case of a symmetrical distribution, this would be on both sides of the mean. Let us call this area the "target zone". 2. Locate points of width similar to the target zone on the horizontal axis extending to the left and right of the target zone without bound (assuming that the distribution is two- dimensional). At each of these points, including the middle point of the target zone, locate a bootstrap universe constructed on the model of the observed sample. 3. By simulation (or some analytic device), produce from the middle (target zone) universe, (say) 100 means of samples size n (n being the observed sample size). 4. Mark those means that fall within and those outside the target zone. 5. Repeat steps 3 and 4 for the first such universes to the left and right of the sample, and then for other universes to the left and right until they are so far away that they are not putting any noticeable number of means in the sample zone. 6. Count the total means in the sample zone; ignore all others. 7. Array all these means according to the universes from which they came. 8. Start at the middle point, and continue outward until the universes between the center and that point account for (say) 95 percent of the means within the sample zone. Mark that point. 9. The marked point constitutes the interval containing 95 percent of the universes that might have given rise to the observed sample. One can then say that there is a 95 percent probability that the observed sample came from a universe with a mean within that interval. It is all-important for this procedure, of course, that the distribution of universes is assumed to be horizontal. But we have not had to make any assumptions about that shapes of the universe(s), not even that they (it) be symmetrical. This third approach has not yet been developed in practice. But the very exercise of thinking it through illuminates the issues involved in constructing conventional confidence intervals or the boundary intervals described in Approach 2. CONFIDENCE INTERVALS AND BAYESIAN ANALYSIS Bayesian thinking can often be valuable in constructing confidence intervals. If one states one's prior beliefs about the distribution of the parameter in question, and then combines that distribution with the observed data, there is nothing mysterious or ambiguous about stating the posterior distribution of belief, which can then be considered as the stuff of a confidence interval. Therefore, Bayesian analyis can serve well to shine clear sunlight on this murky concept. And even if one wishes to state an extremely "uninformative" prior distribution - that is, a state of affairs when one asserts close to no knowledge at all - the Bayesian procedure is admirably clear and consistent, pulling no rabbits from a hat. An illustration (using data from Box and Tiao) may be found in Chapter 00. One need not even do anything differently than standard confidence-interval calculations, to get the benefit of Bayesian analysis. One may simply interpret the results in the Bayesian fashion so as to obtain meaningful statements. CONCLUSION It is not possible in principle to derive a probability statement about the location of the mean or any other parameter of a distribution from a set of data alone, without additional assumptions. One can make unambiguous statements about the probability that any specified distribution, at any given distance from the mean of a sample, would produce a sample of the observed size with a mean located as far or further away from the hypothetical universe's mean as is the observed sample's mean. With various Bayesian-type assumptions, one can make probability statements about the location of the mean of the universe that produced the sample. One can make a simulation with a linear Bayesian prior distribution (or some other prior) that will allow one to make probability statements about the location of the mean of the universe that produced the sample. Whether one wishes to refer to either of the above two procedures as "confidence interval" is a matter of choice. AFTERNOTE: ABOUT THE INFINITE REGRESS PROBLEM This afternote expands on an earlier footnote about Savage's objection to confidence intervals on the grounds that they constitute an infinite regress. Even the next level regression in the sequence that Savage mentions cannot be an important difficulty in practice. If one cares to do so, one may estimate the accuracy of the confidence limits by (in the resampling approach) repeating the overall simulation, and observing the variation in the confidence bounds. If one does this and looks at the 95 percent bounds around the confidence bounds, they are huge - so large as to be without meaning in the cases of proportions that Peter Bruce and I have looked at. But this is only surprising until one thinks about it; such large variation is inevitable given that the result is something like a .052 probability. The exploration in the paragraph above leads back to the question of why confidence limits tend to focus on the same 95 percent and 99.7 percent values as found in classical hypothesis testing. Those values were selected long ago for hypothesis testing because they seem to be intuitive measures of the relevant psychological surprise. And for purposes other than measures of surpise - that is, more directly related to decision- making - hypothesis testing now more frequently (and more sensibly, in my view) looks at the prob-value result itself. But this more flexible prob-value concept does not fit comfortably with confidence intervals. When thought through from scratch, perhaps more sensible confidence values would be 50 percent, or 75 percent, rather thasn 95 percent - which would be closer to the concept traditionally used in physical experiments as a rough plus-or- minus index of reliability and error. The 50 percent bounds on 50 percent confidence limits might then be a meaningful second order measure. As to further regressions - any sensible person stops being concerned with a further order of smalls at some point; one could never live through a day without such approximations. To worry about it is to seek impossible perfection. ENDNOTES **FOOTNOTES** [1]:Savage is troubled by the infinite regress in connection with the estimate of dispersion. "Taking the doctrine literally, it evidently leads to endless regression, for an estimate of the accuracy of an estimate should presumably be accompanied by an estimate of its own accuracy, and so on forever" (p. 257). But if we simply define "accuracy" operationally as the calculations in the approaches discussed below, this difficulty disappears. Savage might say that I have just defined away the difficulty. I'd answer: Yes indeed. It is the highest function of operational definitions such as this one to get us around logical traps and enable us to function with usable tools. This issue is discussed further in the Afternote to the chapter. [1]: Though the logic of confidence intervals is not only subtle but also rests on shakier ground, in my judgment, than that of hypothesis testing, there are thoughtful and respected statisticians - for example, Thomas Wonnacott - who argue that the logic of confidence intervals is better grounded and leads less often to error. [1]: An example of this sort of interpretation is as follows: ... Although on average X-bar is on target, the specific sample mean X-bar that we happen to observe is almost certain to be a bit high or a bit low. Accordingly, if we want to be reasonably confident that our inference is correct, we cannot claim that mu is precisely equal to the observed X-bar. Instead, we must construct an interval estimate or confidence interval of the form: mu = X-bar + sampling error The crucial question is: How wide must this allowance for sampling error be? The answer, of course, will depend on how much X-bar fluctuates... Constructing 95% confidence intervals is like pitching horseshoes. In each case there is a fixed target, either the population mu or the stake. We are trying to bracket it with some chancy device, either the random interval or the horseshoe... There are two important ways, however, that confidence intervals differ from pitching horseshoes. First, only one confidence interval is customarily constructed. Second, the target mu is not visible like a horseshoe stake. Thus, whereas the horseshoe player always knows the score (and specifically, whether or not the last toss bracketed the stake), the statistician does not. He continues to "throw in the dark," without knowing whether or not a specific interval estimate has bracketed mu. All he has to go on is the statistical theory that assures him that, in the long run, he will succeed 95% of the time. (Wonnacott and Wonnacott, 1990, p. 258). Savage refers to this type of interpretation as follows: ...is a sort of fiction; for it will be found that whenever its advocates talk of making assertions that have high probability, whether in connection with testing or estimation, they do not actually make such assertions themselves, but endlessly pass the buck, saying in effect, "This assertion has arisen according to a system that will seldom lead you to make false assertions, if you adopt it. As for myself, I assert nothing but the properties of the system."(1972, pp. 260-261) Lee writes at greater length:[where else is quote below?] [T]he statement that a 95% confidence interval for an unknown parameter ran from -2 to +2 sounded as if the parameter lay in that interval with 95% probability and yet I was warned that all I could say was that if I carried out similar procedures time after time then the unknown parameters would lie in the confidence intervals I constructed 95% of the time. Subsequently, I discovered that the whole theory had been worked out in very considerable detail in such books as Lehmann (1959, 1986). But attempts such as those that Lehmann describes to put everything on a firm foundation raised even more questions. (Lee, 1989, p. vii) [1]: Efron and Tibshirani (1993, p. 157) suggest an approach that is computationally like Approach 2, but they interpret the computation differently and and refer to it as a confidence interval. They also say that the approach applies only to a Normal distribution, whereas I see no reason for such a restriction. [2]: Peter Bruce has convinced me that a goodly number of distributions would result in asymetric confidence intervals. This can cause considerable complications for the conventional formulaic calculations, though resampling handles them nicely. The interpretation requires a longer statement than otherwise, however. [1]: You can consider this one throw as a sample of one, with that throw as the mean observation, if the prior discussion of sample means would otherwise lead you to question this example. [1]: More about this later; it is, as I said earlier, not of primary importance in estimating the accuracy of the confidence intervals; note, please, that as we talk about the accuracy of statements about accuracy, we are moving down the ladder of sizes of causes of error. [1]: When working with proportions, the conventional method must obtain these points from prepared ellipses and binomial tables, not from the sort of geometric trick used in the previous paragraphs, and hence showing the distribution centered at xbar = mu is quite misleading. **ENDNOTES** <1>: Peter Bruce's help in clarifying the ideas in this chapter by discussing them with me, along with teaching them jointly with me, has been especially great. page 2 \statphil Chapter II-3 statconf 4-23-9623