Box Plot
Introduction A box plot is a way of summarizing a set of datameasured on an interval scale. It is often used in exploratory data analysis. Itis a type of graph which is used to show the shape of the distribution, itscentral value, and spread. The picture produced consists of the mostextreme values in the data set (maximum and minimum values), the lower and upperquartiles, and the median. Box plots are also very useful whenlarge numbers of observations are involved and when two or more data sets arebeing compared. They are helpful for indicating whether a distribution is skewed and whether there areany unusual observations (outliers) in the data set. Let us revise a few statistical terms beforegoing ahead with the box plots. Median : The median value in a dataset is suchthat there are equal number of values greater than the median as are less thanthe median. When the dataset is sorted, the median is the middle value in thedataset. If the dataset has even number of values then the median is the averageof the two middle values in the dataset. Quartiles : Quartiles, by definition,separate a quarter of data points from the rest. This roughly means that thefirst quartile is the value under which 25% of the data lie and the thirdquartile is the value over which 25% of the data are found. (Thisindicates that the second quartile is the median itself!). First Quartile, Q1 : Concluding fromthe definitions above, first quartile is the median of the lower half of thedata. If the number of data points is odd, the lower half includes the median. Third Quartile, Q3 : Third quartile is the median of theupper half of the data. If the number of data points is odd, the upper half ofthe data includes the median. See the following example. Consider the following dataset -- 52, 57, 60, 63, 71, 72, 73, 76, 98, 110, 120 The dataset has 10 values sorted in ascendingorder. The median is the middle value, (ie 6th value in this case.) Median = 72 Q1 is the median of the first 6 values, (ie themean of 3rd and 4th values) Q1 = 61.5 Q3 = 87 The mean is the average of all the data values. Mean = 77.45 Interquartile Range = Q3 - Q1. Interquartilerange is a useful measure of the amount of variation in a set of data. Outliers : These are the extreme valuesin the dataset. The Box Plot : After the above preliminaries let us take a lookat the typical box plot XLMiner™ can generate :-
The box extends from Q1 to Q3, includes Q2. Theother non- extreme points are included in its "whiskers". This meansthe box includes the middle one- half of the data. The mean value is shown witha plus sign. The distance "d" between the values that the notchspecifies is --> d =Confidence interval around the mean. The significance of the box plot is that it isnot strongly influenced by the outliers. XLMiner™ constructs the box plot byextending its "whiskers" only to the most extreme "non -outliers". It uses two explicit rules to define the outliers and marks themwith dots. The rules used are Cutoff1 = Q1 - 1.5 * (Q3 - Q1) Cutoff2 = Q3 + 1.5 * (Q3 - Q1) On the boxplot, Min = The actual minimum value in thedataset which is just above the Cutoff1. Max = The actual maximum value in thedataset that is just less than Cutoff2. All the values that are less than the Min value, and the ones that are greater than the Max value are treated as the Outliers.
Seealso: |