Find running median from a stream of integers
The most efficient way to calculate a percentile of a stream that I have found is the P² algorithm: Raj Jain, Imrich Chlamtac: The P² Algorithm for Dynamic Calculation of Quantiiles and Histograms Without Storing Observations. Commun. ACM 28(10): 1076-1085 (1985)
The algorithm is straight forward to implement and works extremely well. It is an estimate, however, so keep that in mind. From the abstract:
A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.
If you can't hold all the items in memory at once, this problem becomes much harder. The heap solution requires you to hold all the elements in memory at once. This is not possible in most real world applications of this problem.
Instead, as you see numbers, keep track of the count of the number of times you see each integer. Assuming 4 byte integers, that's 2^32 buckets, or at most 2^33 integers (key and count for each int), which is 2^35 bytes or 32GB. It will likely be much less than this because you don't need to store the key or count for those entries that are 0 (ie. like a defaultdict in python). This takes constant time to insert each new integer.
Then at any point, to find the median, just use the counts to determine which integer is the middle element. This takes constant time (albeit a large constant, but constant nonetheless).
There are a number of different solutions for finding running median from streamed data, I will briefly talk about them at the very end of the answer.
The question is about the details of the a specific solution (max heap/min heap solution), and how heap based solution works is explained below:
For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,
Step 1: Add next item to one of the heaps
if next item is smaller than maxHeap root add it to maxHeap,
else add it to minHeap
Step 2: Balance the heaps (after this step heaps will be either balanced or
one of them will contain 1 more item)
if number of elements in one of the heaps is greater than the other by
more than 1, remove the root element from the one containing more elements and
add to the other one
Then at any given time you can calculate median like this:
If the heaps contain equal amount of elements;
median = (root of maxHeap + root of minHeap)/2
Else
median = root of the heap with more elements
Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.
If the variance of the input is statistically distributed (e.g. normal, log-normal, etc.) then reservoir sampling is a reasonable way of estimating percentiles/medians from an arbitrarily long stream of numbers.
int n = 0; // Running count of elements observed so far
#define SIZE 10000
int reservoir[SIZE];
while(streamHasData())
{
int x = readNumberFromStream();
if (n < SIZE)
{
reservoir[n++] = x;
}
else
{
int p = random(++n); // Choose a random number 0 >= p < n
if (p < SIZE)
{
reservoir[p] = x;
}
}
}
"reservoir" is then a running, uniform (fair), sample of all input - regardless of size. Finding the median (or any percentile) is then a straight-forward matter of sorting the reservoir and polling the interesting point.
Since the reservoir is fixed size, the sort can be considered to be effectively O(1) - and this method runs with both constant time and memory consumption.