Modular arithmetic - efficiently calculating the remainders of factorials

This is bit faster:

toPrime = 500;
sums = Accumulate@FoldList[Times, 1, Range[2, Prime@toPrime - 1]];
primes = Prime[Range[toPrime]]; 
Mod[sums[[primes - 1]], primes]

Precompute factorial sums and primes. Mod is fast on lists.


Let $x \equiv r_1 \bmod p$ and $y \equiv r_2 \bmod p$. Then, $x y \equiv r_1 r_2 \bmod p$. So, we can compute the sum of the factorials mod p using:

f[p_] := Mod[ Total @ FoldList[ Mod[Times[##], p]&, Range[p-1]], p]

Let's compare this to the naive implementation:

t[p_] := Mod[Sum[k!, {k, p-1}], p]

For example:

f[Prime[500]] //AbsoluteTiming
t[Prime[500]] //AbsoluteTiming

{0.00058, 1813}

{0.085628, 1813}

The nice thing about using Mod[Times[##], p] as the FoldList function is that the output should be a machine number unless you are working with very large primes. That means that f can be compiled:

fc = Compile[{{p, _Integer}},
    Mod[ Total @ FoldList[ Mod[Times[##], p]&, Range[p-1]], p],
    RuntimeAttributes->{Listable}
];

Let's compare for a large prime:

f[Prime[10^6]] //AbsoluteTiming
fc[Prime[10^6]] //AbsoluteTiming

{1.83041, 9308538}

{1.05449, 9308538}

Faster, but the real difference is that fc is Listable. Hence, comparing timings on a list shows a much larger difference. For example:

r1 = f /@ Prime[Range[5000, 5500]]; //AbsoluteTiming
r2 = fc[Prime[Range[5000, 5500]]]; //AbsoluteTiming

r1 === r2

{2.06691, Null}

{0.395402, Null}

True

Finally, since the list of all factorials is not stored anywhere, the memory used is much more manageable. Here is the memory and timing for the first 5000 primes:

r1 = fc[Prime[Range[5000]]]; //MaxMemoryUsed //AbsoluteTiming

{3.03463, 462848}

This is quite a bit smaller than @MichaelE2's answer:

MaxMemoryUsed[
    toPrime = 5000;
    sums = Accumulate@FoldList[Times, 1, Range[2, Prime@toPrime - 1]];
    primes = Prime[Range[toPrime]]; 
    r2 = Mod[sums[[primes - 1]], primes]
] //AbsoluteTiming

r1 === r2

{2.40441, 4064607008}

True

The compiled answer uses .462KB while the approach where all the factorials are precomputed takes 4GB, and the timing is not too different.