Box Plot
|
Introduction A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and spread. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median. Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared. They are helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set. Let us revise a few statistical terms before going ahead with the box plots. Median : The median value in a dataset is such that there are equal number of values greater than the median as are less than the median. When the dataset is sorted, the median is the middle value in the dataset. If the dataset has even number of values then the median is the average of the two middle values in the dataset. Quartiles : Quartiles, by definition, separate a quarter of data points from the rest. This roughly means that the first quartile is the value under which 25% of the data lie and the third quartile is the value over which 25% of the data are found. (This indicates that the second quartile is the median itself!). First Quartile, Q1 : Concluding from the definitions above, first quartile is the median of the lower half of the data. If the number of data points is odd, the lower half includes the median. Third Quartile, Q3 : Third quartile is the median of the upper half of the data. If the number of data points is odd, the upper half of the data includes the median. See the following example. Consider the following dataset -- 52, 57, 60, 63, 71, 72, 73, 76, 98, 110, 120 The dataset has 10 values sorted in ascending order. The median is the middle value, (ie 6th value in this case.) Median = 72 Q1 is the median of the first 6 values, (ie the mean of 3rd and 4th values) Q1 = 61.5 Q3 = 87 The mean is the average of all the data values. Mean = 77.45 Interquartile Range = Q3 - Q1. Interquartile range is a useful measure of the amount of variation in a set of data. Outliers : These are the extreme values in the dataset. The Box Plot : After the above preliminaries let us take a look at the typical box plot XLMiner™ can generate :-
The box extends from Q1 to Q3, includes Q2. The other non- extreme points are included in its "whiskers". This means the box includes the middle one- half of the data. The mean value is shown with a plus sign. The distance "d" between the values that the notch specifies is --> d =Confidence interval around the mean. The significance of the box plot is that it is not strongly influenced by the outliers. XLMiner™ constructs the box plot by extending its "whiskers" only to the most extreme "non - outliers". It uses two explicit rules to define the outliers and marks them with dots. The rules used are Cutoff1 = Q1 - 1.5 * (Q3 - Q1) Cutoff2 = Q3 + 1.5 * (Q3 - Q1) On the boxplot, Min = The actual minimum value in the dataset which is just above the Cutoff1. Max = The actual maximum value in the dataset that is just less than Cutoff2. All the values that are less than the Min value , and the ones that are greater than the Max value are treated as the Outliers.
See also: |