Is there a built-in option for detecting outliers in data?
Outliers are determined by the width of the interquartile range (IQR). This range can differ depending on your school of thought but generally a 95% confidence interval of the data can be found in 1.5 IQR above and below the median.
SeedRandom[90807];
data = Join[RandomVariate[NormalDistribution[], 50],
RandomVariate[ChiSquareDistribution[3], 10]];
We can calculate this range with Quartiles
.
#[[2]] + {-1.5, 1.5} ( #[[3]] - #[[1]]) &@Quartiles[data]
(* {-1.97723, 2.30552} *)
And can use this to Select
the outliers from the data
getOutliers[dat_, iqrCoeff_] :=
Select[! IntervalMemberQ[
Interval[#[[2]] + {-1, 1} iqrCoeff ( #[[3]] - #[[1]]) &@
Quartiles[dat]], #] &]@dat
Then
getOutliers[data, 1.5]
(* {-2.01804, 6.76676, 2.38043, 3.4204, 6.19569, 4.85708, 3.58404, 2.99772} *)
Since you may want to identify a different level of confidence interval, BoxWhiskerChart
gives you the option to alter the IQR coefficient in its ChartElementFunction
option.
BoxWhiskerChart[data, "Outliers",
ChartElementFunction -> ChartElementDataFunction["BoxWhisker", "IQRCoefficient" -> 1]]
And
getOutliers[data, 1]
{-1.77297, 1.96271, -1.46257, -1.29773, -2.01804, -1.49219, 6.76676,
2.38043, 3.4204, 6.19569, 4.85708, 3.58404, 2.99772}
You will notice that BoxWhiskerChart
takes a little presentation license and does not plot the outliers that would print too close to the whisker.
Hope this helps.
The help for BoxWhiskerChart
is not explicit but suggests that it defines outliers as more than 1.5 interquartile ranges above/below the third/first quartile. Far outliers are 3 interquartile ranges outside this region.
I offer the following implementation of this
outlierdistance[x_List] := Module[{lq, med, uq},
{lq, med, uq} = Quartiles[x]; (Ramp[x - uq] + Ramp[lq - x])/(uq - lq)
]
outlier[x_List] := Pick[x, Thread[outlierdistance[x] > 1.5]]