CHAPTER II-1 TRANSLATING SCIENTIFIC QUESTIONS INTO PROBABILISTIC AND STATISTICAL QUESTIONS The first step in using probability and statistics is to translate the scientific question into a statistical question. Once you know exactly which prob-stats question you want to ask -- that is, exactly which probability you want to determine -- the rest of the work is relatively easy. The stage at which you are most likely to make mistakes is in stating the question you want to answer in probabilistic terms. The crucial process of translating from a pre-statistical question to a statistical question takes place in all statistical inference. But its nature comes out most sharply with respect to testing hypotheses, so most of what will be said about it will be in that context. This chapter may seem elementary to the professional statistician, and if so, it may well be skipped. THE THREE TYPES OF QUESTIONS The Scientific Question A study for either scientific or decision-making purposes properly begins with a general question about the nature of the world - that is, a conceptual or theoretical question. One must then transform this question into an operational-empirical form that one can study scientifically. Thence comes the translation into a technical-statistical question. The scientific-conceptual-theoretical question can be an issue of theory, or a policy choice, or the result of curiousity at large. Examples include: Can the bioengineer increase the chance of female calves being born? Has the scarcity of copper been going down? Are the prices of liquor systematically different in states where the liquor stores are publicly owned compared to states where they are privately owned? Does a new formulation of pig rations lead to faster hog growth? Was the rate of unemployment higher last month than the long-run average, or was the higher figure likely to be the result of sampling error? What are the margins of probable error for the unemployment survey? The Operational-Empirical Question The operational-empirical question is framed in measurable quantities in a meaningful design. Should we expect this state of affairs to cause an event like the observed one? Will the mean of the sample will be between x and y? Examples include: How unlikely is it to get nine females out of ten calves in an experiment on your farm? Did the price of copper fall between 1800 and the present? These are empirical questions, which have already been transformed by operationalizing from scientific-conceptual questions. The Statistical Question The statistical question may be: 1) Estimation of a central value, such as: What is the best guess about the mean of the population in which we are interested? 2) Estimation of dispersion and reliability, such as: How likely is the mean to be between x and y? This sort of question is considered by some (but not by me) to be a question in estimation - that is, one's best guess about (say) the magnitude and probable error of the mean or median of a population. This is the form of a question about confidence limits - how likely is the mean to be between x and y? 3) Hypothesis testing, such as: How likely is a given state to produce a state like x? Examples include: What is the probability that a "universe" in which the chance of a female is 100/206 will produce nine females out of ten calves if? How likely would be the observed trend in copper prices since 1800 if by chance all those prices had the same chance of being observed in each of those years? Please notice that the statistical question is framed as a question in probability, not "inverse probability". Indeed, "inverse probability" may well be a vacuous expression. ILLUSTRATIVE TRANSLATIONS Let's illustrate the process of translating a scientific question into a statistical question. Illustration A As of 1964 a study of mine asked: Are doctors' beliefs about the harmfulness of cigarette smoking (and doctors' own smoking behavior) affected by the social groups among whom the doctors live (Simon, 1967-1968)? We decided to define the doctors' reference groups as the states in which they live, because data about doctors and smoking were available state by state (Modern Medicine, 1964). We could then translate this question into an operational and testable scientific hypothesis by asking this question: Do doctors in tobacco-economy states differ from doctors in other states in their smoking, and in their beliefs about smoking? Which numbers would help us answer this question, and how do we interpret those numbers? We now were ready to ask the statistical question: Do doctors in tobacco-economy states "belong to the same universe" (with respect to smoking) as do other doctors? That is, do doctors in tobacco-economy states have the same characteristics -- at least, those characteristics we are interested in, smoking in this case - - as do other doctors? Later we shall see that the way to proceed is to consider the statistical hypothesis that these doctors do indeed belong to that same universe; that hypothesis and the universe will be called "benchmark hypothesis" and "benchmark universe" respectively -- or in more conventional usage, the "null hypothesis". If the tobacco-economy doctors do indeed belong to the benchmark universe - that is, if the benchmark hypothesis is correct - then there is a 49/50 chance that doctors in some state other than the state in which tobacco is most important will have the highest rate of cigarette smoking. But in fact we observe that the state in which tobacco accounts for the largest proportion of the state's income -- North Carolina -- had (as of 1964) a higher proportion of doctors who smoked than any other state. (Furthermore, a lower proportion of doctors in North Carolina than in any other state said that they believed that smoking is a health hazard.) Of course, it is possible that it was just chance that North Carolina doctors smoked most, but the chance is only 1 in 50 if the benchmark hypothesis is correct. Obviously, some state had to have the highest rate, and the chance for any other state was also 1 in 50. But, because our original scientific hypothesis was that North Carolina doctors' smoking rate would be highest, and we then observed that it was highest even though the chance was only 1 in 50, the observation became interesting and meaningful to us. It means that the chances are strong -- 49 in 50 -- that there was a connection between the importance of tobacco in the economy of a state and the rate of cigarette smoking among doctors living there (as of 1964). To consider this problem from another direction, it would be rare for North Carolina to have the highest smoking rate for doctors if there were no special reason for it; in fact, it would occur only once in fifty times. But, if there were a special reason -- and we hypothesize that the tobacco economy provides the reason -- then it would not seem unusual or rare for North Carolina to have the highest rate; therefore we choose to believe in the not-so-unusual phenomenon, that the tobacco economy caused doctors to smoke cigarettes. Like many (most? all?) actual situations, the cigarettes and doctors' smoking issue is a rather messy business. Did I have a clear-cut theoretically-derived prediction before I began? Maybe I did a bit of "data dredging" - that is, maybe I started with a vague expectation, and only arrived at my sharp hypothesis after I saw the data. This would weaken the probabilistic interpretation of the test of significance - but this is something that a scientific investigator does not like to do because it weakens his/her claim for attention and chance of publication. On the other hand, if one were a Bayesian, one could claim that one had a prior probability that the observed effect would occur, and the observed data strengthens that prior; but this procedure would not seem proper to many other investigators. The only wholly satisfactory conclusion is to obtain more data - but as of 1993, there does not seem to have been another data set collected since 1964, and collecting a set by myself is not feasible. This clearly is a case of statistical inference that one could argue about - but perhaps it is true that all cases where the data are sufficiently ambiguous as to require a test of significance are also sufficiently ambiguous that they are properly subject to argument. For some decades the hypothetico-deductive framework was the leading point of view in empirical science. It insisted that the empirical and statistical investigation should be preceded by theory, and only propositions suggested by the theory should be tested. Investigators were not supposed to go back and forth from data to theory to testing. It is now clear that this is an ivory-tower irrelevance, and no one lived by the hypothetico- deductive strictures anyway - just pretended to. Furthermore, there is no sound reason to feel constrained by it, though it strengthens your conclusions if you had theoretical reason in advance to expect the finding you obtained. Illustration B Does medicine CCC cure cancer? You begin with this scientific question and give the medicine to six patients who have cancer; you do not give it to six similar patients who have cancer. Your sample is only twelve people because it is simply not feasible for you to obtain a larger one. Five of six "medicine" patients get well, two of six "no medicine" patients get well. Does the medicine cure cancer? That is, if future cancer patients take the medicine, will their rate of recovery be higher than if they did not take the medicine? One way to translate the scientific question into a statistical question is to ask: Do the "medicine" patients belong to the same universe as the "no medicine" patients? That is, we ask whether "medicine" patients still have the same chances of getting well from the cancer as do the "no medicine" patients, or whether the medicine has bettered the chances of those who took it and thus removed them from the original universe, with its original chances of getting well. The original universe, to which the "no medicine" patients must still belong, is the benchmark universe. Shortly we shall see that we proceed by comparing the observed results against the benchmark hypothesis that the "medicine" patients still belong to the benchmark universe -- that is, they still have the same chance of getting well as the "no medicine" patients. We want to know whether or not the medicine does any good. This question is the same as asking whether patients who take medicine are still in the same population universe as "no medicine" patients, or whether they now belong to a different population in which patients have higher chances of getting well. To recapitulate our translations, we move from asking: Does the medicine cure cancer? to Do "medicine" patients have the same chance of getting well as "no medicine" patients?; and finally to: Do "medicine" patients belong to the same universe (population) as "no medicine" patients? Remember that "population" in this sense does not refer to the population at large, but rather to a group of cancer sufferers (perhaps an infinitely large group) who have given chances of getting well, on the average. Groups with different chances of getting well are called "different populations" (universes). Shortly we shall see how to answer this statistical question. We must keep in mind that our ultimate concern in cases like this one is to predict future results of the medicine, that is, to predict whether use of the medicine will lead to a higher recovery rate than would be observed without the medicine. Illustration C Is method Alpha a better method of teaching reading than method Beta? That is, will method Alpha produce a higher average reading score in the future than will method Beta? Twenty children taught to read with method Alpha have an average reading score of 79, whereas children taught with method Beta have an average score of 84. To translate this scientific question into a statistical question we ask: Do children taught with method Alpha come from the same universe population) as children taught with method Beta? Again, "universe" (population) does not mean the town or social group the children come from, and indeed the experiment will make sense only if the children do come from the same population, in that sense of "population". What we want to know is whether or not the children belong to the same statistical population (universe), defined according to their reading ability, after they have studied with method Alpha or method Beta. Translating from a scientific question into a statistical question is mostly a matter of asking the probability that some given benchmark universe (population) will produce one or more observed samples. Notice that we must (at least for general scientific testing purposes) ask about a given universe whose composition we assume to be known, rather than about a range of universes, or about a universe whose properties are unknown. In fact, there is really only one question that probability statistics can answer: Given some particular benchmark universe of some stated composition, what is the probability that an observed sample would come from it? (Please notice the subtle but all-important difference between the words "would come" in the previous sentence, and the word "came".) A variation of this question is: Given two (or more) samples, what is the probability that they would come from the same universe - that is, that the same universe would produce both of them? In this latter case, the relevant benchmark universe is implicitly the universe whose composition is the two samples combined. The necessity for stating the characteristics of the universe in question becomes obvious when you think about it for a moment. Probability-statistical testing adds up to comparing a sample with a particular benchmark universe, and asking whether there probably is a difference between the sample and the universe. To carry out this comparison, we ask how likely it is that the benchmark universe would produce a sample like the observed sample. But in order to find out whether or not a universe could produce a given sample, we must ask whether or not some particular universe -- with stated characteristics -- could produce the sample. There is no doubt that some universe could produce the sample by a random process; in fact, some universe did. The only sensible question, then, is whether or not a particular universe, with stated (or known) characteristics, is likely to produce such a sample. In the case of the medicine, the universe with which we compare the sample who took the medicine is the benchmark universe to which that sample would belong if the medicine had had no effect. This comparison leads to the benchmark (null) hypothesis that the sample comes from a population in which the medicine (or other experimental treatment) seems to have no effect. It is to avoid confusion inherent in the term "null hypothesis" that I replace it with the term "benchmark hypothesis." Illustration D If one plot of ground is treated with fertilizer, and another similar plot is not treated, the benchmark (null) hypothesis is that the corn raised on the treated plot is no different than the corn raised on the untreated lot -- that is, that the corn from the treated plot comes from ("belongs to") the same universe as the corn from the untreated plot. If our statistical test makes it seem very unlikely that a universe like that from which the untreated-plot corn comes would also produce corn such as came from the treated plot, then we are willing to believe that the fertilizer has an effect. For a psychological example, substitute the words "group of children" for "plot," "special training" for "fertilizer," and "I.Q. score" for "corn." So far we have discussed the scientific question and the statistical question. There is always a generalization question, too: Do the statistical results from this particular sample of, say, rats apply to a universe of humans? This question can be answered only with wisdom, common sense, and general knowledge, and not with probability statistics.