Write a program to find 100 largest numbers out of an array of 1 billion numbers
I realized that this is tagged with 'algorithm', but will toss out some other options, since it probably should also be tagged 'interview'.
What is the source of the 1 billion numbers? If it is a database then 'select value from table order by value desc limit 100' would do the job quite nicely - there might be dialect differences.
Is this a one-off, or something that will be repeated? If repeated, how frequently? If it is a one-off and the data are in a file, then 'cat srcfile | sort (options as needed) | head -100' will have you quickly doing productive work that you are getting paid to do while the computer handles this trivial chore.
If it is repeated, you would advise picking any decent approach to get the initial answer and store / cache the results so that you could continuously be able to report the top 100.
Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.
Better luck on the next interview.
You can keep a priority queue of the 100 biggest numbers, iterate through the 1 billion numbers. Whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.
A priority queue implemented with a heap has insert + delete complexity of O(log K)
. (Where K = 100, the number of elements to find. N = 1 billion, the number of total elements in the array).
In the worst case you get billion*log2(100)
which is better than billion*log2(billion)
for an O(N log N) comparison-based sort1.
In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K)
rather than O(N log N)
, this can be very significant when K is very small comparing to N.
The expected time of this priority queue algorithm is pretty interesting, since in each iteration an insertion may or may not occur.
The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K
random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability.
For example, lets assume the numbers were randomly selected uniformly from {0, 1}
, the expected value of (i-K)th number (out of i numbers) is (i-k)/i
, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i
.
Thus, the expected number of insertions is:
And the expected running time can be expressed as:
(k
time to generate the queue with the first k
elements, then n-k
comparisons, and the expected number of insertions as described above, each takes an average log(k)/2
time)
Note that when N
is very large comparing to K
, this expression is a lot closer to n
rather than N log K
. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.
But we don't know that the array values are uniformly distributed. They might trend towards increasing, in which case most or all numbers will be be new candidates for the set of 100 largest numbers seen. The worst case for this algorithm is O(N log K)
.
Or if they trend towards decreasing, most of the largest 100 numbers will be very early, and our best-case run time is essentially O(N + K log K)
, which is just O(N)
for K
much smaller than N
.
Footnote 1: O(N) integer sorting / histogramming
Counting Sort or Radix Sort are both O(N), but often have larger constant factors that make them worse than comparison sorts in practice. In some special cases they're actually quite fast, primarily for narrow integer types.
For example, Counting Sort does well if the numbers are small. 16-bit numbers would only need an array of 2^16 counters. And instead of actually expanding back into a sorted array, you could just scan the histogram you build as part of Counting Sort.
After histogramming an array, you can quickly answer queries for any order statistic, e.g. the 99 largest numbers, the 200 to 100th largest numbers.) 32-bit numbers would scatter the counts over a much larger array or hash table of counters, potentially needing 16 GiB of memory (4 bytes for each of 2^32 counters). And on real CPUs, probably getting lots of TLB and cache misses, unlike an array of 2^16 elements where L2 cache would typically hit.
Similarly, Radix Sort could look at only the top buckets after a first pass. But the constant factors may still be larger than log K
, depending on K.
Note that the size of each counter is large enough to not overflow even if all N integers are duplicates. 1 billion is somewhat below 2^30, so a 30-bit unsigned counter would be sufficient. And a 32-bit signed or unsigned integer is just fine.
If you had many more, you might need 64-bit counters, taking twice the memory footprint to initialize to zero and to randomly access. Or a sentinel value for the few counters that overflow a 16 or 32-bit integer, to indicate that the rest of the count is elsewhere (in a small dictionary such as a hash table mapping to 64-bit counters).
You can iterate over the numbers which takes O(n)
Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.
The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.
If this is asked in an interview, the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.
The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.
Anyway, making the question more specific and clearer is good for you in an interview.