| Basic Commands |
Probability Puzzles |
Hypothesis Test, Count Data |
Hypothesis Test, Measured Data |
Confidence Interval, Count Data |
Confidence Interval, Measured Data |
Association / Correlation |
Regression |
Other Examples |
Darwin-2
[rnk sum version; see DARWIN-1 for the difference in means version]
Problem
In DARWIN-1, we examined some of Darwin's data for the heights of cross- and of self-fertilized plants grown in 4 different pots to test whether the data supported the assumption that cross-fertilized plants had better growth rates. One of the problems with the dataset was its variability and the relatively small number of plants. In this version, we convert the quantitative data to ranks, a method that diminishes the effect of extreme values. Using this method, can we confirm the hypothesis that seeds from crossed flowers will grow taller than seeds from self-fertilized plants (see Noreen, 1989)? The data below is the same as that in DARWIN-1 except that relative ranks for the height of plants in each pot have been calculated. (Note the handling of ties in Pot 4.)
Darwin-2 Table. Heights of Cross- and Self-Fertilized Plants Studied by Darwin
|
Heights of individual plants (in inches) |
Sum of Ranks |
Diff, rank sums |
| Pot I |
Crossed |
23.5 |
12.0 |
21.0 |
|
|
|
|
|
Rank |
6 |
1 |
5 |
|
|
12 |
|
|
Selfed |
17.4 |
20.4 |
20.0 |
|
|
|
3 |
|
Rank |
2 |
4 |
3 |
|
|
9 |
|
| Pot II |
Crossed |
22.0 |
19.2 |
21.5 |
|
|
|
|
|
Rank |
6 |
3 |
5 |
|
|
14 |
|
|
Selfed |
20.0 |
18.4 |
18.6 |
|
|
|
7 |
|
Rank |
4 |
1 |
2 |
|
|
7 |
|
| Pot III |
Crossed |
22.2 |
20.4 |
18.3 |
21.6 |
23.2 |
|
|
|
Rank |
9 |
7 |
5 |
8 |
10 |
39 |
|
|
Selfed |
18.6 |
15.2 |
16.5 |
18.0 |
16.25 |
|
23 |
|
Rank |
6 |
1 |
3 |
4 |
2 |
16 |
|
| Pot IV |
Crossed |
21.0 |
22.1 |
23.0 |
12.0 |
|
|
|
|
Rank |
6 |
7 |
8 |
1 |
|
23 |
|
|
Selfed |
18.0 |
12.8 |
15.5 |
18.0 |
|
9 |
|
|
Rank |
4.5 |
2 |
3 |
4.5 |
|
14 |
|
Note. Data are from Darwin, 1900, cited in Noreen, 1989.
We measure the difference between crossed and selfed plants by the difference in their sums of ranks. Specifically, rank the heights of all plants in each pot, the lowest plant being #1. Add up the rank values for all the crossed plants, and similarly the rank values for the selfed plants. Subtract the selfed sum from the crossed sum. This is the difference for one pot. Then add the sum of differences from all pots. This value of 42 is the observed statistic.
Null hypothesis (H0): There is no real difference between cross- and self-fertilized plants as far as their growth rates, and the observed difference in rank sums occurs by chance. Alternative hypothesis (H1): Cross-fertilized plants grow faster than self-fertilized plants.
Resampling Procedure
We test whether the observed difference in rank sums (42) might occur by chance through the same procedure (permutation within each pot) that we used in DARWIN-1, except replacing the height measurements with their ranks. (See DARWIN-1 for a more detailed explanation.)
- For each pot, write the numbers 1 through n, where n is the number of plants in that pot (6 for pots 1 and 2, 10 for pot 3, and 8 for pot 4).
- For pot 1, shuffle the papers (there are 6 in all) and take half the pieces as simulated "crossed" plants, the remaining half as simulated "selfed" plants. Sum the values for "crossed" and "selfed."
- Subtract the sum of "selfed" from the sum of "crossed." Note this value, the difference between the sum of ranks.
- Repeat steps 2-3 for pots 2, 3, and 4. (Note that you will be using 10 and 8 pieces of paper, respectively, for pots 3 and 4.) Add up all the differences for the four pots, and record the result on a scoreboard.
- Repeat steps 2-3 perhaps 10,000 times. Determine how often the resampling result was at least as great as the experimental value of 42. (We look only at positive values, since we are investigating the possibility that crossed plants are higher than selfed plants.)
Computer Implementation In Resampling Stats
MAXsize scrboard 10000
the default size of vectors is 1000, so we need to make room in scrboard for 10,000 numbers
DATA 1,6 rank1
establish vectors holding ranks 1 . . 6 for pots 1 and 2
DATA 1,6 rank2
DATA 1,10 rank3
pot 3 had 10 plants but otherwise is handled the same way as pots 1 and 2
DATA 1,8 rank4
pot 4 had 8 plants
REPEAT 10000
shuffle the values within each pot separately because of different plant numbers in each pot then carry out the same computation performed originally with the actual data by definition, the heights of 6 plants are ranked "1" through "6"
SHUFFLE rank1 rank1$
randomize these values
TAKE rank1$ 1,3 x1$
the first 3 randomized values are ascribed to "x1$", simulated crosses from pot 1
TAKE rank1$ 4,6 s1$
the remaining 3 are ascribed to "s1$", simulated selfed from pot 1
SUM x1$ mx1$
the sum of the simulated crossed group is stored in "mx1$"
SUM s1$ ms1$
similarly the sum of the simulated selfed group is stored in "ms1$"
SUBTRACT mx1$ ms1$ diff1$
the difference between these simulated sums is held in "diff1$"; proceed exactly the same way to simulate pot 2
SHUFFLE rank2 rank2$
TAKE rank2$ 1,3 x2$
TAKE rank2$ 4,6 s2$
SUM x2$ mx2$
SUM s2$ ms2$
SUBTRACT mx2$ ms2$ diff2$
SHUFFLE rank3 rank3$
TAKE rank3$ 1,5 x3$
note that we use 5 values in each simulated group for pot 3
TAKE rank3$ 6,10 s3$
SUM x3$ mx3$
SUM s3$ ms3$
SUBTRACT mx3$ ms3$ diff3$
SHUFFLE rank4 rank4$
TAKE rank4$ 1,4 x4$
TAKE rank4$ 5,8 s4$
SUM x4$ mx4$
SUM s4$ ms4$
SUBTRACT mx4$ ms4$ diff4$
CONCAT diff1$ diff2$ diff3$ diff4$ alldiff$
put all four simulated differences into one vector
SUM alldiff$ sumdiff$
which is summed into "avdiff$"
SCORE sumdiff$ scrboard
and the result kept in the scoreboard
END
HISTOGRAM scrboard
COUNT scrboard >= 42 bigger
in how many runs was the sum difference between ranks at least as great as for the original data?
size scrboard trials
Count the number of repeats by counting the number of values in vector scrboard. We could also have divided by 10,000 or whatever number of REPEATs were used.
DIVIDE bigger trials prob
divide by the number of trials to express the results as a proportion
PRINT prob
Results
Frequency histogram of resampled difference in rank sums
prob = 0.0007
Conclusion
Although converting the quantitative height values into ranks threw away information, it also removed the influence of extreme height values. In 10,000 simulated runs, only 7 (p-value 0.0007) resulted in a difference between ranks sums that was at least as large as the observed value of 42. Therefore we can reject the null hypothesis. We conclude that crossed plants were definitely taller than selfed plants.
References
Darwin, C. (1900). The effects of cross and self-fertilization in the vegetable kingdom (2nd ed.). London: John Murray.
Noreen, E.W. (1989). Computer intensive methods for testing hypotheses: An introduction. New York: Wiley.