How can I speed up the binary GCD algorithm using __builtin_ctz?
Here's my iterative implementation from the comments:
While tail-recursive algorithms are often elegant, iterative implementations are almost always faster in practice. (Modern compilers can actually perform this transform in very simple cases.)
unsigned ugcd (unsigned u, unsigned v)
{
unsigned t = u | v;
if (u == 0 || v == 0)
return t; /* return (v) or (u), resp. */
int g = __builtin_ctz(t);
while (u != 0)
{
u >>= __builtin_ctz(u);
v >>= __builtin_ctz(v);
if (u >= v)
u = (u - v) / 2;
else
v = (v - u) / 2;
}
return (v << g); /* scale by common factor. */
}
As mentioned, the |u - v| / 2
step is typically implemented as a very efficient, unconditional right shift, e.g., shr r32
, to divide by (2)
- as both (u)
, (v)
are odd, and therefore |u - v|
must be even.
It's not strictly necessary, as the 'oddifying' step: u >>= __builtin_clz(u);
will effectively perform this operation in the next iteration.
Supposing that (u)
or (v)
have a 'random' bit distribution, the probability of (n)
trailing zeroes, via tzcnt
, is ~ (1/(2^n))
. This instruction is an improvement over bsf
, the implementation for __builtin_clz
prior to Haswell, IIRC.
Thanks to helpful commentators, I have found the crucial mistake: I should have used min
instead of max
This is the final solution:
#include <algorithm>
constexpr unsigned gcd(unsigned u, unsigned v)
{
if (u == v || u == 0 || v == 0)
return u | v;
// effectively compute min(ctz(u), ctz(v))
unsigned shift = __builtin_ctz(u | v);
u >>= __builtin_ctz(u);
v >>= __builtin_ctz(v);
const auto &[min, max] = std::minmax(u, v);
return gcd(max - min, min) << shift;
}
int main() {
constexpr unsigned g = gcd(25, 15); // g = 5
return g;
}
This solution also has very nice, nearly branch-free compile output.
Here are some benchmark results of all the answers so far (we actually beat std::gcd
):