Calculating mid in binary search
The following C++ program can show you how an overflow can happen with a 32-bit unsigned integer:
#include <iostream>
using namespace std;
int main ()
{
unsigned int low = 33,
high = 4294967290,
mid;
cout << "The value of low is " << low << endl;
cout << "The value of high is " << high << endl;
mid = (low + high) / 2;
cout << "The value of mid is " << mid << endl;
return 0;
}
If you run it on a Mac:
$ g++ try.cpp && ./a.out
The value of low is 33
The value of high is 4294967290
The value of mid is 13
The value of mid
might be expected to be 2147483661
, but low + high
overflowed because a 32-bit unsigned integer cannot contain the proper value, and give back 27
, and so mid
becomes 13
.
When the calculation of mid
is changed to
mid = low + (high - low) / 2;
Then it will show
The value of mid is 2147483661
The simple answer is, the addition l + u
can overflow, and has undefined behavior in some languages, as described in a blog post by Joshua Bloch, about a bug in the Java library for the implementation of binary search.
Some readers may not understand what it is about:
l + (u - l) / 2
Note that in some code, the variable names are different, and it is
low + (high - low) / 2
The answer is: let's say if you have two numbers: 200 and 210, and now you want the "middle number". And let's say if you add any two numbers and the result is greater than 255, then it can overflow and the behavior is undefined, then what can you do? A simple way is just to add the difference between them, but just half of it, to the smaller value: look at what the difference is between 200 and 210. It is 10. (You can consider it the "difference" or "length", between them). So you just need to add 10 / 2 = 5
to 200, and get 205. You don't need to add 200 and 210 together first -- and that's how we can reach the calculation: (u - l)
is the difference. (u - l) / 2
is half of it. Add that to l
and we have l + (u - l) / 2
.
It is like, if we are looking at two trees, one is 200 feet tall and one is 210 feet tall, what is the "midpoint" or the "mean"? We don't have to add them together first. We can just tell the difference is 10 feet, and we can add half of that, which is 5, to 200, and we know it is 205 feet.
To put this into history perspectives, Robert Sedgewick mentioned that the first binary search was stated in 1946, and it wasn't correct until 1964. Jon Bentley described in his book Programming Pearls in 1988 that more that 90% of the professional programmers could not write it correctly given a couple of hours. But even Jon Bentley himself had that overflow bug for 20 years. A study that was published in 1988 showed that accurate code for binary search was only found in 5 out of 20 textbooks. In 2006, Joshua Bloch wrote that blog post about the bug about calculating the mid
value. So it took 60 years for this code to be correct. But now, next time in the job interview, remember to write it correctly within that 5 minutes.
This post covers this famous bug in a lot of detail. As others have said it's an overflow issue. The fix recommended on the link is as follows:
int mid = low + ((high - low) / 2);
// Alternatively
int mid = (low + high) >>> 1;
It is also probably worth mentioning that in case negative indices are allowed, or perhaps it's not even an array that's being searched (for example, searching for a value in some integer range satisfying some condition), the code above may not be correct as well. In this case, something as ugly as
(low < 0 && high > 0) ? (low + high) / 2 : low + (high - low) / 2
may be necessary. One good example is searching for the median in an unsorted array without modifying it or using additional space by simply performing a binary search on the whole Integer.MIN_VALUE
–Integer.MAX_VALUE
range.
The problem is that (l+u)
is evaluated first, and could overflow int, so (l+u)/2
would return the wrong value.
Jeff suggested really good post to read about this bug, here is summary if you want quick overview.
In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.