find median in O(log n)

Okay, with the update to the question so the intent is clear (not just find the median, but re-find the median each time you receive a new number), I think there's a way.

I'd start with a pair of heaps: a max-heap and a min-heap. The min-heap will contain the numbers larger than the median, and the max-heap the numbers smaller than the median. When you receive the first number, that's your median. When you receive the second, you insert the smaller of the two into the max-heap, and the larger of the two into the min-heap. The median is then the average of the smallest on the min-heap, and the largest on the max-heap.

Along with the two heaps, you'll want storage for a single integer that will be the current median when you've received an odd number of inputs. You'll populate that fairly simply: if you receive an input with it currently full, you basically sort those two items (the new number and the old median) and insert the smaller into the heap for the smaller items, and larger into the heap for larger items. Your new median will then be the mean of the bases of those two heaps (and you'll mark the other storage location as empty).

When you receive a new number with that empty, you'll compare the new number to the median. If it's between the numbers as the bases of the heaps, it's the new median, and you're done. Otherwise, extract the number from the base that must hold the median (larger numbers if the new number is larger, smaller if it's smaller) and put that into the median spot, then insert the new number into the heap that came from.

At least if memory serves, the extract/insert into a heap should be O(log N). I believe everything else involved should be constant complexity.


(I assume that you're after an algorithm that, given n existing numbers and one new number, will take logarithmic time to find the median of the new collection of n+1 numbers, so that the total run time for adding n numbers will be O(n lg n).)

There is probably a named algorithm for this already, but here's my idea: Maintain a red-black tree into which you insert the numbers as they arrive. In each node, in addition to the number itself and the child/parent pointers, you store an integer that tells the number of nodes that exist below this node (including the node itself, for convenience). I'm quite certain that this information can be updated in logarithmic time on each insert operation, even when tree rotations are required. With this information embedded in the tree, locating the median can be done in logarithmic time if you also keep track of the number of nodes in the tree.

(This might be a slightly too high-level description; let me know if you need more details.)


Hoare's selection algorithm (aka quickselect) can do this in O(n) average time.

It's basically recursively partitions the data set with a random pivot and checks the appropriate part. There is also a median of medians algorithm which has guaranteed O(n) worst time complexity, but for the normal usage it's usually an overkill.