Differing quantiles: Boxplot vs. Violinplot
This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot
refers to boxplot.stats
, which uses hinges
that are very similar but not necessarily identical to the quantiles. ?boxplot.stats
says:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.
The hinge vs quantile
distinction could thus be one source for the difference.
Second, geom_violin
refers to a density estimate. The source code here points to a function StatYdensity
, which leads me to here. I could not find the function compute_density
, but I think (also due to some pointers in help files) it is essentially density
, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but
by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))
do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.
The second factor that @coffeinjunky raised seems to be the main cause. Here is some more evidence to bolster that.
By switching to geom_ydensity
, one can empirically confirm that the difference is due to the geom_violin
using the kernel density estimate to compute the quantiles, rather than the actual observations. For example, if we force a wide bandwidth (bw=1
), then the estimated densities will be over-smoothed and deviate further from the observation-based quantiles used in the boxplots:
require(ggplot2)
require(cowplot)
theme_set(cowplot::theme_cowplot())
d = iris
ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) +
stat_ydensity(bw=1, fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
, colour = "red", size = 1.5) +
stat_boxplot(geom ='errorbar', width = 0.1)+
geom_boxplot(width = 0.2)+
facet_grid(. ~ Species, scales = "free_x") +
xlab("") +
ylab (expression(paste("Value"))) +
coord_cartesian(ylim = c(3.5,9.5)) +
scale_y_continuous(breaks = seq(4, 9, 1)) +
theme(axis.text.x=element_blank(),
axis.text.y = element_text(size = rel(1.5)),
axis.ticks.x = element_blank(),
strip.background=element_rect(fill="black"),
strip.text=element_text(color="white", face="bold"),
legend.position = "none") +
background_grid(major = "xy", minor = "none")
So, yes, be careful with this one - the parameters of the density estimation can impact the results!