Optimising multiplication modulo a small prime

This doesn't answer the question directly, but I would recommend not doing this in pure Python if you're looking for performance. Some options:

  • Make a small library in C that does your computations, and use Python's ctypes to talk to it.
  • Use numpy; probably the best option if you want to stay out of having to deal with compiling stuff yourself. Doing operations one at a time won't be faster than Python's own operators, but if you can put multiple ones in a numpy array, computations on them will be much faster than the equivalent in Python.
  • Use cython to declare your variables as C integers; again, same as numpy, you will benefit from this the most if you do it in batches (because then you can also optimize the loop).

To do this calculation in assembly, but have it callable from Python, I'd try inline assembly from a Python module written in C. Both GCC and MSVC compilers feature inline assembly, only with differing syntax.

Note that our modulus p = 1000000007 just fits into 30-bits. The result desired (a*b)%p can be computed in Intel 80x86 registers given some weak restrictions on a,b not being much bigger than p.

Restrictions on size of a,b

(1) a,b are 32-bit unsigned integers

(2) a*b is less than p << 32, i.e. p times 2^32

In particular if a,b are each less than 2*p, overflow will be avoided. Given (1), it also suffices that either one of them is less than p.

The Intel 80x86 instruction MUL can multiply two 32-bit unsigned integers and store the 64-bit result in accumulator register pair EDX:EAX. Some details and quirks of MUL are discussed in Section 10.2.1 of this helpful summary.

The instruction DIV can then divide this 64-bit result by a 32-bit constant (the modulus p), storing the quotient in EAX and the remainder in EDX. See Section 10.2.2 of the last link. The result we want is that remainder.

It is this division instruction DIV that entails a risk of overflow, should the 64-bit product in numerator EDX:EAX give a quotient larger than 32-bits by failing to satisfy (2) above.

I'm working on a code snippet in C/inline assembly for "proof of concept". However the maximum benefit in speed will depend on batching up arrays of data a,b to process, amortizing the overhead of function calls, etc. in Python (if that is the target platform).


You mention that "a, b are of the same order of magnitude as p." Often in cryptography this means that a,b are large numbers near p, but strictly less-than p.

If this is the case, then you could use the simple identity

a-p \equiv a \pmod{p}

to turn your calculation into

result = ((a-p)*(b-p))%p

You've then turned one large multiplication into two large subtractions and a small multiplication. You'll have to profile to see which is faster.