"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

I use these incremental/recursive mean and median estimators, which both use constant storage:

mean += eta * (sample - mean)
median += eta * sgn(sample - median)

where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)

This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.

I would love to see an incremental mode estimator of a similar form...

UPDATE

I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function (http://en.wikipedia.org/wiki/Quantile_function) tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.

Skewness and Kurtosis

For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.

Median

Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.

Median and Mode with Frequency Counts

If it is integers, I would count frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.

Normally Distributed Random Variables

If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).

Further comments

All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.

I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.

In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.

I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.

"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

Tags:

Algorithm

Iterator

Statistics

Median

Related

Recent Posts