In ggplot2, what do the end of the boxplot lines represent?
The "dots" at the end of the boxplot represent outliers. There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot use is the "1.5 rule". If a data point is:
- less than Q1 - 1.5*IQR
- greater than Q3 + 1.5*IQR
then that point is classed as an "outlier". The whiskers are defined as:
upper whisker = min(max(x), Q_3 + 1.5 * IQR)
lower whisker = max(min(x), Q_1 – 1.5 * IQR)
where IQR = Q_3 – Q_1, the box length. So the upper whisker is located at the smaller of the maximum x value and Q_3 + 1.5 IQR, whereas the lower whisker is located at the larger of the smallest x value and Q_1 – 1.5 IQR.
Additional information
- See the wikipedia boxplot page for alternative outlier rules.
- There are actually a variety of ways of calculating quantiles. Have a look at `?quantile for the description of the nine different methods.
Example
Consider the following example
> set.seed(1)
> x = rlnorm(20, 1/2)#skewed data
> par(mfrow=c(1,3))
> boxplot(x, range=1.7, main="range=1.7")
> boxplot(x, range=1.5, main="range=1.5")#default
> boxplot(x, range=0, main="range=0")#The same as range="Very big number"
This gives the following plot:
As we decrease range from 1.7 to 1.5 we reduce the length of the whisker. However, range=0
is a special case - it's equivalent to "range=infinity"
P1IMSA Tutorial 8 - Understanding Box and Whisker Plots video offers a visual step-by-step explanation of (Tukey) box and whisker plots.
At 4m 23s I explain the meaning of the whisker ends and its relationship to the 1.5*IQR.
Although the chart shown in the video was rendered using D3.js rather than R, its explanations jibe with the R implementations of boxplots mentioned.
I think ggplot using the standard defaults, the same as boxplot: "the whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box"
See: boxplot.stats