CHAPTER II-4 CONFIDENCE INTERVALS II: PROCEDURE WITH EXAMPLES Here is a checklist for the canonical procedure for confidence intervals. It follows much the same logic as presented for testing hypotheses in an earlier chapter. We shall begin with the binomial example of a political poll, and then present the "continuous" multi-valued example of tree heights. The Accuracy of Political Polls Consider the reliability of a randomly-selected 1988 presidential election poll, showing 840 intended votes for Bush and 660 intended votes for Dukakis out of 1500 (Wonnacott and Wonnacott, 1990, p. 5)). Let us work through the logic of this example. What is the question? Stated technically, what are the 95% confidence limits for the proportion of Bush supporters in the population? (The proportion is the mean of a binomial population or sample, of course.) More broadly, within which bounds could one confidently believe that the population proportion was likely to lie? At this stage of the work, we must have already translated the conceptual question (in this case, a decision- making question from the point of view of the candidates) into a statistical question. (See Chapter II-1 on translating questions into statistical form.) What is the purpose to be served by answering this question? There is no sharp and clear answer in this case. The goal could be to satisfy public curiosity, or strategy planning for a candidate (though a national proportion is not as helpful for planning strategy as state data would be). Is this a "probability" or a "probability-statistics" question? The latter; we wish to infer from sample to population rather than the converse. Given that this is a statistics question: What is the form of the statistics question - confidence limits or hypothesis testing? Confidence limits. Given that the question is about confidence limits: What is the description of the sample that has been observed? a) The raw sample data - the observed numbers of interviewees are 840 for Bush and 660 for Dukakis - constitutes the best description of the universe. The statistics of the sample are the given proportions - 56 percent for Bush, 44 percent for Dukakis. Which universe? (Assuming that the observed sample is representative of the universe from which it is drawn, what is your best guess about the properties of the universe about whose parameter you wish to make statements? The best guess is that the population proportion is the sample proportion - that is, the population contains 56 percent Bush votes, 44 percent Dukakis votes. Possibilities for Bayesian analysis? Not in this case, unless you believe that the sample was biased somehow. Which parameter(s) do you wish to make statements about? Mean, median, standard deviation, range, interquartile range, other? We wish to estimate the proportion in favor of Bush (or Dukakis). Which symbols for the observed entities? Perhaps 56 green and 44 yellow balls, if an urn is used, or "0" and "1" if the computer is used. Discrete or continuous distribution? In principle, discrete. (All distributions must be discrete in practice.) What values or ranges of values? 0-1. Finite or infinite? Infinite - the sample is small relative to the population. If the universe is what you guess it to be, what variation among which samples do you wish to estimate? A sample the same size as the observed poll. Here one may continue either with resampling or with the conventional method. Everything done up to now would be the same whether continuing with resampling or with a standard parametric test. Conventional Calculational Methods Estimating the Distribution of Differences Between Sample and Population Means With the Normal Distribution. In the conventional approach, one could in principle work from first principles with lists and sample space, but that would surely be too cumbersome. One could work with binomial proportions, but this problem has too big a sample for tree-drawing and quincunx techniques; even the ordinary textbook table of binomial coefficients is too small for this job. Calculating binomial coefficients also is a big job. So instead one would use the Normal approximation to the binomial formula. (Note to the non-statistician: The distribution of means that we manipulate has the Normal shape because of the operation of the Law of Large Numbers. Sums and averages, when the sample is reasonably large, take on this shape even if the underlying distribution is not Normal. This is a truly astonishing property of randomly-drawn samples - the distribution of their means quickly comes to resemble a "Normal" distribution, no matter the shape of the underlying distribution. We then standardize it with the standard deviation or other device? so that we can state the probability distribution of the sampling error of the mean for any sample of reasonable size.) (The exercise of creating the Normal shape empirically is simply a generalization of particular cases such as we will later create here for the poll by resampling simulation. One can also go one step further and use the formula of de Moivre-Laplace- Gauss to describe the empirical distributions, and instead of them. Looking ahead now, the difference between resampling and the conventional approach can be said to be that in the conventional approach we simply plot the Gaussian distribution very carefully, and use a formula instead of the empirical histograms, afterwards putting the results in a standardized table so that we can read them quickly without having to re- create the curve each time we use it. More about the nature of the Normal distribution may be found in Chapter 00 [Statphil].) All the work done above uses the information specified previously - the sample size of 1500, the drawing with replacement, the observed proportion as the criterion. Confidence Intervals Empirically - With Resampling Estimating the Distribution of Differences Between Sample and Population Means By Resampling What procedure to produce entities? Random selection from urn or computer. Simple (single step) or complex (multiple "if" drawings)? Simple. What procedure to produce re-samples? That is, with or without replacement? With replacement. Number of drawings observations in actual sample, and hence, number of drawings in resamples? 1500. What to record as result of each re-sample drawing? Mean, median, or whatever of re-sample? The proportion is what we seek. Stating the distribution of results: The distribution of proportions for the trial samples. Choice of confidence bounds?: 95%, two tails (choice made by the textbook that posed the problem). Computation of probabilities within chosen bounds: Read the probabilistic result from the histogram of results. Because the theory of confidence intervals is so abstract (even with the resampling method of computation), let us now walk through this resampling demonstration slowly, using the conventional Approach 1 described previously. We first produce a sample, and then see how the process works in reverse to estimate the reliability of the sample, using the Bush-Dukakis poll as an example. The computer program and output may be found in Chapter 00 Howteach Step 1: Draw a sample of 1500 voters from a universe that, based on the observed sample, is 56 percent for Bush, 44 percent for Dukakis. The first such sample produced by the computer happens to be 53 percent for Bush; it might have been 58 percent, or 55 percent, or very rarely, 49 percent for Bush. Step 2: Repeat step 1 perhaps 400 or 1000 times. Step 3: Estimate the distribution of means (proportions) of samples of size 1500 drawn from this 56-44 percent Bush-Dukakis universe; the resampling result is shown in Figure II-4-1 Figure II-4-1 Step 4: In a fashion is similar to what was done in steps 1- 3, now compute the 95 percent confidence intervals for some other postulated universe mean - say 53% for Bush, 47% for Dukakis. This step produces a confidence interval that is not centered on the sample mean and the estimated universe mean, and hence it shows the independence of the our procedure from that magnitude. And we now compare the breadth of the mean estimated confidence intervals for the 5 and 95 percentiles generaqted with the 53-47 percent universe against the corresponding distribution of sample means generated by the "true" Bush-Dukakis population of 56 percent - 44 percent. If the procedure works well, the results of the two procedures should be similar. Now we interpret the results using this first approach. The histogram shows the probability that the difference between the sample mean and the population mean - the error in the sample result - will be (say) 4 percentage points too low. It follows that about 47.5 percent (half of 95 percent) of the time, a sample like this one will be between the population mean and 4 percent too low. We do not know the actual population mean. But for any observed sample like this one, we can say that there is a 47.5 percent that the distance between it and the mean of the population that generated it is minus four percent or less. Now a crucial step: We turn around the statement just above, and say that there is an 47.5 percent chance that the population mean is less than four percentage points higher than the mean of a sample drawn like this one, but at or above the sample mean. (And we do the same for the other side of the sample mean.) So to recapitulate: We observe a sample and its mean. We estimate the error by experimenting with one or more universes in that neighborhood, and we then give the probability that the population mean is within that margin of error from the sample mean. We can also use Approach 2, which is computationally simply a short-circuiting of Approach 1 (though the interpretations differ), as follows: Step 1: As above. Step 2: With a hypothetical distribution that is 56 percent for Bush (the sample estimate) (and in a non-binomial case, with the dispersion estimated from the sample) generate perhaps 400 samples of size 1500. Step 3: Find the 95th percentile of the samples in Step 2. Step 4: Centered at that 95th percentile, generate a distribution of samples of size 1500 with the population dispersion assumed the same as in step 2. Step 5: Find the boundary which includes 95 percent of the samples. If this boundary is indeed the sample mean, then the point at which this distribution is centered is indeed the 95 percent confidence interval (as it must be as long as the dispersion used in all of the universes is the same; they are just set off from each other algebraically.) Approach 2 for Counted Data: the Bush-Dukakis Poll Let's implement Approach 2 for counted data, using for comparison the Bush-Dukakis poll data discussed earlier in the context of Approach 1. We seek to state, for universes that we select on the basis that their results will interest us, the probability that they (or it, for a particular universe) would produce a sample as far or farther away from the mean of the universe in question as the mean of the observed sample - 56 percent for Bush. The most interesting universe is that which produces such a sample only about 5 percent of the time, simply because of the correspondence of this value to a conventional break-point in statistical inference. So we could experiment with various universes by trial and error to find this universe. We can learn from our previous simulations of the Bush- Dukakis poll in Approach 1 that about 95 percent of the samples fall within .035 on either side of the sample mean (which we had been implicitly assuming is the location of the population mean). If we assume (and there seems no reason not to) that the dispersions of the universes we experiment with are the same, we will find (by symmetry) that the universe we seek is centered on those points .035 away from .56, or .535 and .585. From the standpoint of Approach 2, then, the conventional sample formula that is centered at the mean can be considered a shortcut to estimating the boundary distributions. We say that the boundary is at the point that centers a distribution which has only a (say) 2.5 percent chance of producing the observed sample; it is that distribution which is the subject of the discussion - that is, one of the distributions at the endpoints of the vertical line in Figure II-3-1 - and not the distribution which is centered at mu = xbar. [1] The results of these simulations are shown in Figure II-4-2. Figure II-4-2 About these distribution centered at .535 and .585 - or more importantly for understanding an election situation, the universe centered at .535 - one can say: Even if the "true" value is as low as 53.5 percent for Bush, there is only a 2 1/2 percent chance that a sample as high as 56 percent pro-Bush would be observed. (The values of a 2 1/2 percent probability and a 2 1/2 percent difference between 56 percent and 53.5 percent are seemingly related arithmetically only by chance in this case.) It would be even more revealing in an election situation to make a similar statement about the universe located at 50-50, but this would bring us almost entirely within the intellectual ambit of hypothesis testing. The demonstrations above using both Approaches 1 and 2 shed light on the logic of interpretation of confidence intervals. We have no basis in the work so far to say that there is a 95 percent chance that the confidence intervals computed from a particular sample captures the universe mean, or to make any other such statement about the universe mean. Even so, unless you have reason to believe that the probabilities of some universe means are very different than others in the neighborhood of the sample mean - which would seem to be a safe assumption in the case with the presidential poll - then it would seem reasonable to make betting odds that there is a 95 percent chance that the confidence intervals computed from a particular sample captures the universe mean. If so, there would seem nothing objectionable in this "naive" interpretation for a particular sample. Samples Whose Observations May Have More Than Two Values So far we have discussed samples and universes that we can characterize as proportions of elements which can have only one of two characteristics - green or red, 1 or 0. Now let us consider observations that can be characterized by a wider variety of numbers; these cases are both simpler and more complex than proportional universes. These are problems with "continuous" (really multi-valued) data instead of the two-value election poll problem above. The binomial case has a deceptively easy appearance; in many ways the present problem is easier to do than most. (Incidentally, in contrast to the Bush-Dukakis poll example above, the 1992 U. S. presidential election was not binomial but trinomial, and therefore it is a much more difficult problem to deal with.) A collection that contains only two sorts of elements (say, green and red chips) can be characterized by just the proportion (and the total number of elements). But a collection of (say) prices of farms sold in province Z in year t would be characterized by the numbers sold at each of many prices (and the total number of sales). In the latter case, we notice at least two characteristics: a) some sort of average, and b) the extent to which the elements are spread out (and there may be yet other characteristics that interest us). The inferences that we make about the dispersion of such a collection are another important part of statistical inference, interesting both for the information in itself and for the light it throws on the certainty of our other inferences. Consider, for instance, that we have just the one sale price of 13Q. We could estimate that the distribution is centered around 13Q, but we have no idea whether they are all 13Q, or whether the other prices tend to be far from 13. What if there are only two sale prices - 13Q and 15Q, but we have no other information, not even the meaning of a Q unit? What can we reasonably say about the distribution, given that we have been assured that the two observations are a representative sample of prices? We might immediately guess that half of the population is within, and half outside of, 13Q and 15Q. But what shape should we guess for the distribution? Should it be horizontal? Shaped like a Normal curve? Skewed to the right? Here we have no recourse but to use some additional experience and perhaps theory. If we have some additional observations - say 10 more - we could estimate the dispersion of the population, perhaps calculating a standard deviation. That would give some guidance even without assuming a shape for the distribution. If we had some reason to assume that the distribution is shaped Normally - say, if it arose from observations of a planet, and the scatter could be assumed to be due to "error" - we could immediately do the sort of inference that led to the Normal distribution two centuries ago. If one of the observations is quite far from the others - an apparent "outlier" - we could calculate its probability if it is part of the same distribution, using the standard deviation or other measure of the distribution's dispersion. This would throw some light on whether it probably was generated by the same universe as were the other observations. Approach 1 for Measured Data Example: Estimating Tree Diameters What is the question? A horticulturist is experimenting with a new type of tree. She plants 20 of them on a plot of land, and measures their trunk diameter after two years. She wants to establish a 90% confidence interval for the population average trunk diameter. For the data given below, calculate the mean of the sample and calculate (or describe a simulation procedure for calculating) a 90% confidence interval around the mean. Here are the 20 diameters (in no particular order): 8.5 7.6 9.3 5.5 11.4 6.9 6.5 12.9 8.7 4.8 4.2 8.1 6.5 5.8 6.7 2.4 11.1 7.1 8.8 7.2 What is the purpose to be served by answering the question? Either Research & Development, or pure science Is this a "probability" or a "statistics" question? Statistics. What is the form of the statistics question? Confidence limits. What is the description of the sample that has been observed? The raw data as shown above. Statistics of the sample? Mean of the tree data. Which universe? Assuming that the observed sample is representative of the universe from which it is drawn, what is your best guess about the properties of the universe whose parameter you wish to make statements about? Answer: That the universe is like the sample above, containing the numbers 8.5...7.2 the population of trees that will grow with this new type, as best estimated by the observations in the sample. (Are there possibilities for Bayesian analysis?) No Bayesian prior information will be included. Which parameter do you wish to make statements about? The mean. Which symbols for the observed entities? Cards or computer entries with numbers 8.5...7.2, sample of an infinite size. If the universe is as guessed at, the variation among which samples do you wish to estimate? Samples of size 20. Here one may continue with conventional method. Everything up to now is the same whether continuing with resampling or with standard parametric test. The information listed above is the basis for a conventional test. Use perhaps a t test. Calculate the standard deviation, and apply to t (Show the Normal first). Read the number of degrees of freedom from the sample size above. Show the formula for mu +- 2 s. d. Continuing with resampling What procedure will be used to produce the trial entities? Random selection. Simple (single step), not complex (multiple "if") sample drawings). What procedure to produce re-samples? With replacement. Number of drawings? 20 trees What to record as result of re-sample drawing? The mean. How to state the distribution of results? See histogram. Choice of confidence bounds: 90% ?, two-tailed Computation of probabilities within chosen bounds Read from histogram. Approach 2 for Measured Data: The Diameters of Trees To implement Approach 2 for measured data, one may proceed exactly as with Approach 1 above except that the output of the simulation with the sample mean as midpoint will be used for guidance about where to locate trial universes for Approach 2. Working from the histogram in Figure II-3-?, we try universes located at 53.8 and 58.2. The results are shown in Figure II-4- 3. Figure II-4-3 Interpretation of Approach 2 Now to interpret the results of the second approach: Assuming that the sample is not drawn in a biased fashion (such as the wind blowing all the apples in the same direction), and assuming that the population has the same dispersion as the sample, we can say that distributions centered at the 95 percent confidence points (each of them including a tail with 2.5 percent of the area), or even further away from the sample mean, will produce the observed sample only 5 percent of the time or less. The result of the second approach is more in the spirit of a hypothesis test than of the usual interpretation of confidence intervals. Another statement of the result of the second approach is: We postulate a given universe - say, a universe at (say) the two-tailed 95 percent boundary line. We then say: The probability that the observed sample would be produced by a universe with a mean as far (or further) from the observed sample's mean as the universe under investigation is only 2.5 percent. This is similar to the prob-value interpretation of a hypothesis-test framework. It is not a direct statement about the location of the mean of the universe from which the sample has been drawn. But it is certainly reasonable to derive a betting-odds interpretation of the statement just above, to wit: the chances are 2 1/2 in 100 (or, the odds are 2 1/2 to 97 1/2) that a population located here would generate a sample with a mean as far away as the observed sample. And it would seem legitimate to proceed to the further betting-odds statement that (assuming we have no additional information) the odds are 97 1/2 to 2 1/2 that the mean of the universe that generated this sample is no farther away from the sample mean than the mean of the boundary universe under discussion. About this statement there is nothing slippery, and its meaning should not be controversial. Here again the tactic for interpreting the statistical procedure is to restate the facts of the behavior of the universe that we are manipulating and examining at that moment. We use a heuristic device to find a particular distribution - the one that is at (say) the 97 1/2 - 2 1/2 percent boundary - and simply state explicitly what the distribution tells us implicitly: The probability of this distribution generating the observed sample (or a sample even further removed) is 2 1/2 percent. We could go on to say (if it were of interest to us at the moment) that because the probability of this universe generating the observed sample is as low as it is, we "reject" the "hypothesis" that the sample came from a universe this far away or further. Or in other words, we could say that because we would be very surprised if the sample were to have come from this universe, we instead believe that another hypothesis is true. The "other" hypothesis often is that the universe that generated the sample has a mean located at the sample mean or closer to it than the boundary universe. The behavior of the universe at the 97 1/2 - 2 1/2 percent boundary line can also be interpreted in terms of our "confidence" about the location of the mean of the universe that generated the observed sample. We can say: At this boundary point lies the end of the region within which we would bet 97 1/2 to 2 1/2 that the mean of the universe that generated this sample lies to the (say) right of it. As noted in the preview to this chapter, we do not learn about the reliability of sample estimates of the population mean (and other parameters) by logical inference from any one particular sample to any one particular universe, because in principle this cannot be done. Instead, in this second approach we investigate the behavior of various universes at the borderline of the neighborhood of the sample, the characteristics of those universes being chosen on the basis of their resemblances to the sample. We seek, for example, to find the universes that would produce samples with the mean of the observed sample less than (say) 5 percent of the time. In this way the estimation of confidence intervals is like all other statistical inference: One investigates the probabilistic behavior of hypothesized universes, the hypotheses being implicitly suggested by the sample evidence but not logically implied by that evidence. Approaches 1 and 2 may (if one chooses) be seen as identical conceptually as well as (in many cases) computationally. But as I see it, the interpretation of them is rather different, and distinguishing them helps one's intuitive understanding. THE PROBLEM OF UNCERTAINTY ABOUT THE DISPERSION The inescapable difficulty of estimating the amount of dispersion in the population has greatly exercised statisticians over the years. Hence I must try to clarify the matter. Yet in practice this issue turns out not to be the likely source of much error even if one is somewhat wrong about the extent of dispersion, and therefore we should not let it be a stumbling block in the way of our producing estimates of the accuracy of samples in estimating population parameters. Student's t test was designed to get around the problem of the lack of knowledge of the population dispersion. But Wallis and Roberts wrote about the t test: "[F]ar-reaching as have been the consequences of the t distribution for technical statistics, in elementary applications it does not differ enough from the normal distribution... to justify giving beginners this added complexity" (1956, p. x). "Although Student's t and the F ratio are explained... the student ... is advised not ordinarily to use them himself but to use the shortcut methods... These, being non- parametric and involving simpler computations, are more nearly foolproof in the hands of the beginner - and, ordinarily, only a little less powerful" (p. xi).<1> If we knew the population parameter - the proportion, in the case we will discuss - we could easily determine how inaccurate the sample proportion is likely to be. If, for example, we wanted to know about the likely inaccuracy of the proportion of a sample of 100 voters drawn from a population of a million that is 60% Democratic, we could simply simulate drawing (say) 200 samples of 100 voters from such a universe, and examine the average inaccuracy of the 200 sample proportions. But in fact we do not know the characteristics of the actual universe. Rather, the nature of the actual universe is what we seek to learn about. Of course, if the amount of variation among samples were the same no matter what the Republican-Democrat proportions in the universe, the issue would still be simple, because we could then estimate the average inaccuracy of the sample proportion for any universe and then assume that it would hold for our universe. But it is reasonable to suppose that the amount of variation among samples will be different for different Democrat-Republican proportions in the universe. Let us first see why the amount of variation among samples drawn from a given universe is different with different relative proportions of the events in the universe. Consider a universe of 999,999 Democrats and one Republican. Most samples of 100 taken from this universe will contain 100 Democrats. A few (and only a very very few) samples will contain 99 Democrats and one Republican. So the biggest possible difference between the sample proportion and the population proportion (99.9999%) is less than one percent (for the very few samples of 99% Democrats). And most of the time the difference will only be the tiny difference between a sample of 100 Democrats (sample proportion = 100%), and the population proportion of 99.9999%. Compare the above to the possible difference between a sample of 100 from a universe of half a million Republicans and half a million Democrats. At worst a sample could be off by as much as 50% (if it got zero Republicans or zero Democrats), and at best it is unlikely to get exactly 50 of each. So it will almost always be off by 1% or more. It seems, therefore, intuitively reasonable (and in fact it is true) that the likely difference between a sample proportion and the population proportion is greatest with a 50%-50% universe, least with a 0%-100% universe, and somewhere in between for probabilities between 50% and the endpoints, in the fashion of Figure II-4-4. Figure II-4-4 Though one commonly estimates the variation of sample means (sample sizes the same as the observed sample) for proportions in the neighborhood of the estimate population mean - which implies a population dispersion (s. d.) appropriate for that neighborhood, one could also use a more "conservative" estimate of dispersion; Mosteller et. al. (1970) suggest that if you work with the largest possible amount of variation (for example, the value at .5 in the case of a problem involving a proportion), you ensure that you cannot obtain too small a confidence interval by underestimating the variation. (Here again we see the role of judgment, as discussed in Chapter 00) Perhaps it will help to clarify the issue of estimating dispersion if we consider this: between an estimate for a second sample based on a) the population, or on b) the first sample, the former will be more accurate than the latter, because of the sampling variation in the first sample that affects the latter estimate. But we cannot estimate that sampling variation without knowing more about the population. ARGUMENTS ABOUT INTERPRETATION OF CONFIDENCE INTERVALS Discussions of confidence intervals often assert that one cannot make a probability statement about where the population mean may be, but one can make statements about the probability that a set of samples may bound it. For example: ... Although on average X-bar is on target, the specific sample mean X-bar that we happen to observe is almost certain to be a bit high or a bit low. Accordingly, if we want to be reasonably confident that our inference is correct, we cannot claim that mu is precisely equal to the observed X-bar. Instead, we must construct an interval estimate or confidence interval of the form: mu = X-bar + sampling error The crucial question is: How wide must this allowance for sampling error be? The answer, of course, will depend on how much X-bar fluctuates... Constructing 95% confidence intervals is like pitching horseshoes. In each case there is a fixed target, either the population mu or the stake. We are trying to bracket it with some chancy device, either the random interval or the horseshoe. This analogy is illustrated in Figure 8-3. There are two important ways, however, that confidence intervals differ from pitching horseshoes. First, only one confidence interval is customarily constructed. Second, the target mu is not visible like a horseshoe stake. Thus, whereas the horseshoe player always knows the score (and specifically, whether or not the last toss bracketed the stake), the statistician does not. He continues to "throw in the dark," without knowing whether or not a specific interval estimate has bracketed mu. All he has to go on is the statistical theory that assures him that, in the long run, he will succeed 95% of the time. (Wonnacott and Wonnacott, 1990, p. 258). This criticism does not seem to me to fit approach 1 above. The criticism apparently stems from objections by the frequentists. But if one takes the operational-definition point of view (see Chapter 00), and if we agree that our interest is upcoming events and probably decision-making, then we obviously are interested in putting betting odds on the location of the population mean (and subsequent samples). A statement about process will not help us with that, but only a probability statement. Notice that in the earlier discussion it was never necessary to use the notion of the "true" population mean that such writers as Wonnacott and Wonnacott employ (see their appendix). As discussed in Chapter 00, the notion of a "true parameter" tends to confuse the issue, and is out of keeping Einstein's device of the operational definition. Rather than having in mind some "true" value, we should instead ask: "What will happen if I...", or "...if I again..." Bayesians, too, complain of the process point of view. Savage writes that the process ...is a sort of fiction; for it will be found that whenever its advocates talk of making assertions that have high probability, whether in connection with testing or estimation, they do not actually make such assertions themselves, but endlessly pass the buck, saying in effect, "This assertion has arisen according to a system that will seldom lead you to make false assertions, if you adopt it. As for myself, I assert nothing but the properties of the system."(1972, pp. 260-261) Lee writes at greater length:[where else is quote below?] [T]he statement that a 95% confidence interval for an unknown parameter ran from -2 to +2 sounded as if the parameter lay in that interval with 95% probability and yet I was warned that all I could say was that if I carried out similar procedures time after time then the unknown parameters would lie in the confidence intervals I constructed 95% of the time. Subsequently, I discovered that the whole theory had been worked out in very considerable detail in such books as Lehmann (1959, 1986). But attempts such as those that Lehmann describes to put everything on a firm foundation raised even more questions. (Lee, 1989, p. vii) NOTES ON THE USE OF CONFIDENCE INTERVALS 1. Confidence intervals are used more frequently in the physical sciences - indeed, the concept was developed for use in astronomy - than in bio-statistics and in the social sciences; in these latter fields, measurement is less often the main problem and the distinction between hypotheses often is difficult. 2. Some statisticians suggest that one can do hypothesis tests with the confidence-interval concept. But that seems to me equivalent to suggesting that one can get from New York to Chicago by flying first to Los Angeles. Additionally, the logic of hypothesis tests is much clearer than the logic of confidence intervals, and it corresponds to our intuitions so much more easily. 3. Discussions of confidence intervals sometimes assert that one cannot make a probability statement about where the population mean may be, yet can make statements about the probability that a particular set of samples may bound that mean. If one takes the operational-definition point of view (see discussion of that concept in connection with the concept of probability), and we agree that our interest is upcoming events and probably decision-making, then we obviously are interested in putting betting odds on the location of the population mean (and subsequent samples). And a statement about process will not help us with that, but only a probability statement. Moving progressively farther away from the sample mean, we can find a universe that has only some (any) specified small probability of producing a sample like the one observed. One can say that this point represents a "limit" or "boundary" between which and the sample mean may be called a confidence interval, I suppose. SUMMARY Let's summarize what one can and cannot assert about confidence intervals: 1. One can always state the probability that a given population S will produce a given sample s (or more precisely, a sample with a given mean xbar, or other parameter). This is a straightforward deduction which can be performed either theoretically with formal probability theory or with a Monte Carlo resampling technique. Indeed, such statements are the core of all statistics problems; all the rest of statistics is interpretation. 2. Derived from (1) above, one can state the relative probabilities, and the ratio of them, of the probabilities of two given S's producing a given s. 3. One cannot ever estimate the probability that a particular sample came from any particular population - or even put probabilistic bounds (confidence limits) around its mean - on the basis of sample evidence alone. This is the issue of induction that mathematicians and philosophers have been struggling with for more than two centuries, and undoubtedly before that, too. Even if one knows the mean of a population that would produce the observed sample (or a sample even further away) only (say) 5 percent of the time, one cannot say anything about the probability that that particular population produced the observed sample based on only the sample evidence. The probability of any given population depends on probabilities of other populations. To see that this is so, postulate that we have been told that a given sample of green and red balls was produced by either one of two universes - A with a proportion of X green balls, and B with a proportion of Y green balls - and it is equally likely which urn the sample was drawn from. Assume we are able to state (using Bayes reasoning) that it is twice as likely that the sample came from urn A as urn B. If we take urn A as our reference, it is clear that if alternative urn B had a different proportion than Y as stated above, our conclusion would be different than twice as likely. This demonstrates that without some other assumption about the alternatives to any stated population, no meaningful probability statement could be made about the probability that a sample came from that universe. Here again I repeat the crucial distinction between discussing the probability that a sample could come from a given universe, and the probability that a sample came from a given universe. The former is straightforward, as in (1) above; the latter cannot be stated meaningfully without additional assumptions. Not distinguishing between these two statements may be at the heart of most muddles about the fundamentals of statistics. With the first approach described in this chapter, we can sensibly say something about the probability that the mean of the population that produced a particular sample is within some distance of the sample mean, or that a particular population has only an X percent chance of producing a sample like this one. Those statements are entirely different from speaking about the probability that the sample came from a given population. With the second approach described in this chapter, one can say that the confidence interval includes all the means of populations that have a greater than 5 percent chance of producing the observed sample. This crucial statement may be cumbersome, but it is logically airtight. On the other hand this does not imply - so far as I can now see - anything about the mean of the population from which this sample actually came - or more precisely, the population that produced this sample. The oft-denounced statement that the confidence interval includes the population mean, or that the population mean lies within those bounds, with probability of (say) 95 percent is loose but not too bad if we include implicit assumptions about non-bias and about the dispersion of the population and the sample. Or, as some would prefer, this procedure will lead to those points bracketing the population mean 95 percent of the time you do this sort of thing. Such statements probably are not very inaccurate, given that the world around us is well-behaved in such respects most of the time (see Chapter I-1). And such statements should be generally acceptable. But they are not logically implied. Nor can any of this be proven empirically in any way, so far as I know. (It might be tested on assumptions of equality of dispersion along the continuum, and assuming a continuum of some sort. But this may not be a profitable avenue of thought.) ENDNOTES **FOOTNOTES** [1]: When working with proportions, the conventional method must obtain these points from prepared ellipses and binomial tables, not from the sort of geometric trick used in the previous paragraphs. Hence showing the distribution centered at xbar = mu, as in the conventional approach, is quite misleading. [out???: There seems to me to be no basis for this, either. After all, a single sample may be regarded as n samples of size one. Why should one be able to draw different sorts of conclusions from a set of samples of size one than for the evidence in all those samples aggregated into a single large sample? The principle is the same. **ENDNOTES** <1>: They go on to say, "Techniques and details, beyond a comparatively small range of fairly basic methods, are likely to do more harm than good in the hands of beginners...The great ideas...are lost...nonparametric [methods] involving simpler computations, are more nearly foolproof in the hands of the beginner" (1956, viii, xi). Their stance is very much in contrast to that of Fisher, who wrote somewhere about the t test as a "revolution"..