How to compute 2⁶⁴/n in C?

I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.

phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.

uint64_t divide_two_to_the_64(uint64_t n) {
  return (-n)/n + 1;
}

The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    div     rdi
    add     rax, 1
    ret

I've come up with another solution which was inspired by this question. From there we know that

(a1 + a2 + a3 + ... + an)/n =

(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n

By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have

(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n

= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n

= (264 - n)/n + ((264 - n) % n)/n

264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be

uint64_t twoPow64div(uint64_t n)
{
    return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}

The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution

On MSVC however there's an intrinsic for 128-bit division, so you can use like this

uint64_t remainder;
return _udiv128(1, 0, n, &remainder);

which results in the cleanest output

    mov     edx, 1
    xor     eax, eax
    div     rcx
    ret     0

Here's the demo

On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these

(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)

although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation

There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have

264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1

which is essentially just another way to get Nate Eldredge's answer

Here's some demo for other compilers on godbolt

See also:

  • Trick to divide a constant (power of two) by an integer
  • Efficient computation of 2**64 / divisor via fast floating-point reciprocal

We use a 64-bit CPU

Which 64-bit CPU?

In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.

Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".

In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.

Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".

Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.

Hopefully you can see how this situation might be considered "less than ideal".

What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:

  __uint128_t numerator = (__uint128_t)1 << 64;
  if(n > 1) {
      return (uint64_t)(numerator/n);
  }

I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).

In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.

For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    div     rdi
    add     rax, 1
    ret

...and the best code that's possible is:

    mov     edx, 1
    xor     rax, rax
    div     rdi
    ret