Why is this prime test so slow?

short answer: the counting sieve is slower than Turner's (a.k.a. "naive") sieve because it emulates direct RAM access with sequential counting, which forces it to pass along the streams unsieved between the marking stages. This is ironic because counting makes it a "genuine" sieve of Eratosthenes, as opposed to the Turner's trial division sieve. Actually removing the multiples, like the Turner's sieve is doing, would mess up the counting.

Both algorithms are extremely slow because they start up multiples elimination work too early from each found prime instead of its square, thus creating too many unneeded stream processing stages (whether filtering or marking) - O(n) of them, instead of just ~ 2*sqrt n/log n, in producing primes up to n in value. No checking for the multiples of 7 is required until 49 is seen in the input.

This answer explains how sieve can be seen as building a pipeline of stream-processing "transducers" behind itself, as it is working:

[2..] ==> sieve --> 2
[3..] ==> mark 1 2 ==> sieve --> 3
[4..] ==> mark 2 2 ==> mark 1 3 ==> sieve 
[5..] ==> mark 1 2 ==> mark 2 3 ==> sieve --> 5
[6..] ==> mark 2 2 ==> mark 3 3 ==> mark 1 5 ==> sieve 
[7..] ==> mark 1 2 ==> mark 1 3 ==> mark 2 5 ==> sieve --> 7
[8..] ==> mark 2 2 ==> mark 2 3 ==> mark 3 5 ==> mark 1 7 ==> sieve
[9..] ==> mark 1 2 ==> mark 3 3 ==> mark 4 5 ==> mark 2 7 ==> sieve
[10..]==> mark 2 2 ==> mark 1 3 ==> mark 5 5 ==> mark 3 7 ==> sieve
[11..]==> mark 1 2 ==> mark 2 3 ==> mark 1 5 ==> mark 4 7 ==> sieve --> 11

The Turner sieve uses nomult p = filter ((/=0).(`rem`p)) in place of mark _ p entries but otherwise looks the same:

[2..] ==> sieveT --> 2
[3..] ==> nomult 2 ==> sieveT --> 3
[4..] ==> nomult 2 ==> nomult 3 ==> sieveT 
[5..] ==> nomult 2 ==> nomult 3 ==> sieveT --> 5
[6..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT 
[7..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT --> 7
[8..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> nomult 7 ==> sieveT

Each such transducer can be implemented as a closure frame (a.k.a. "thunk"), or a generator with mutable state, that is unimportant. The output of each such producer goes directly as input into its successor in a chain. There are no unevaluated thunks here, each is forced by its consumer, to produce its next output.

So, to answer your question,

I suspect it has something to do with iterating each element in the list in the mark function.

yes, exactly. They both run non-postponed schemes otherwise.

The code can be thus improved by postponing the start of stream-marking:

primes = 2:3:filter (>0) (sieve [5,7..] (tail primes) 9)

sieve (x:xs) ps@ ~(p:t) q 
   | x < q = x:sieve xs ps q
   | x==q  = sieve (mark xs 1 p) t (head t^2)
  where
    mark (y:ys) k p 
       | k == p    = 0 : (mark ys 1 p)      -- mark each p-th number in supply
       | otherwise = y : (mark ys (k+1) p)

It now runs at just above O(k^1.5), empirically, in k primes produced. But why count by ones when we can count by an increment. (Each 3rd odd number from 9 can be found by adding 6, again and again.) And then instead of marking, we can weed out the numbers right away, getting ourselves a bona fide sieve of Eratosthenes (even if not the most efficient one):

primes = 2:3:sieve [5,7..] (tail primes) 9

sieve (x:xs) ps@ ~(p:t) q 
   | x < q = x:sieve xs ps q
   | x==q  = sieve (weedOut xs (q+2*p) (2*p)) t (head t^2)
  where
    weedOut i@(y:ys) m s 
       | y < m = y:weedOut ys m s
       | y==m  = weedOut ys (m+s) s
       | y > m = weedOut i (m+s) s

This runs at above O(k^1.2) in k primes produced, quick-n-dirty testing compiled-loaded into GHCi, producing upto 100k - 150k primes, deteriorating to O(k^1.3) at about 0.5 mil primes.

So what kind of speedups are achieved by this? Comparing the OP code with the "Wikipedia"'s Turner sieve,

primes = sieve [2..] :: [Int]
  where
    sieve (x:xs) = x : sieve [y | y <- xs, rem y x /= 0]

there was an 8x speedup of W/OP at 2k (i.e. producing 2000 primes). But at 4k it was a 15x speedup. The Turner sieve seems to run at about O(k^1.9 .. 2.3) empirical complexity in producing k = 1000 .. 6000 primes, and the counting sieve at O(k^2.3 .. 2.6) for the same range.

For the two versions here in this answer, v1/W was an additional 20x faster at 4k, and 43x at 8k. v2/v1 was 5.2x at 20k, 5.8x at 40k and 6.5x faster at producing 80,000 primes.

(for comparison, the priority-queue code by Melissa O'Neill runs at about O(k^1.2) empirical complexity, in k primes produced. It scales much better than the code here, of course).

Here is the sieve of Eratosthenes's definition:

P = {3,5, ...} \ &bigcup; { {p*p, p*p + 2*p, ...} | p in P }

The key to Sieve of Eratosthenes's efficiency is direct generation of multiples of primes, by counting in increments of (twice) prime's value from each prime; and their direct elimination, made possible by conflation of value and address much like in integer sorting algorithms (possible only with mutable arrays). It is immaterial whether it must produce pre-set number of primes or work indefinitely because it can always work by segments.

It's not only the thunks that make it horrendously slow, that algorithm would also be very slow if implemented in C on a finite bit array.

sieve :: [Integer] -> [Integer]
sieve (0 : xs) = sieve xs
sieve (n : xs) = n : sieve (mark xs 1 n)
  where
    mark :: [Integer] -> Integer -> Integer -> [Integer]
    mark (y:ys) k m | k == m = 0 : (mark ys 1 m)
                    | otherwise = y : (mark ys (k+1) m)

For each prime p, this algorithm checks all numbers from p+1 to the limit whether they are multiples of p. It doesn't do so by dividing, as Turner's sieve does, but by comparing a counter to the prime. Now, comparing two numbers is much faster than dividing, but the price paid for that is that each number n is now inspected for each prime < n, and not only for the primes up to n's smallest prime factor.

The result is that the complexity of this algorithm is O(N^2 / log N) versus O( (N/log N)^2 ) for the Turner sieve (and O(N*log (log N)) for a real sieve of Eratosthenes).

The ~~nesting~~¹ stacking of thunks mentioned by dflemstr exacerbates the problem², but even without that the algorithm would be worse than Turner's. I'm appalled and fascinated at the same time.

¹ "Nesting" may not be the correct word. Although each of the mark thunks is reachable only via the one above it, they don't reference anything from the enclosing thunk's scope.

² There is nothing quadratic in either size or depth of the thunks, though, and the thunks are fairly well-behaved. For the illustration, let's pretend mark were defined with reverse argument order. Then, when 7 is found to be prime, the situation is

sieve (mark 5 2 (mark 3 1 (mark 2 1 [7 .. ])))
~> sieve (mark 5 2 (mark 3 1 (7 : mark 2 2 [8 .. ])))
~> sieve (mark 5 2 (7 : mark 3 2 (mark 2 2 [8 .. ])))
~> sieve (7 : mark 5 3 (mark 3 2 (mark 2 2 [8 .. ])))
~> 7 : sieve (mark 7 1 (mark 5 3 (mark 3 2 (mark 2 2 [8 .. ]))))

and the next pattern-match by sieve forces the mark 7 1 thunk, which forces the mark 5 3 thunk, which forces the mark 3 2 thunk, which forces the mark 2 2 thunk, which forces the [8 .. ] thunk and replaces the head with 0, and wraps the tail in a mark 2 1 thunk. That bubbles up to the sieve, which discards the 0 and then forces the next stack of thunks.

So for every number from p_k + 1 to p_(k+1) (inclusive), the pattern-match in sieve forces a stack/chain of k thunks of the form mark p r. Each of these takes the (y:ys) received from the enclosed thunk ([y ..] for the innermost mark 2 r) and wraps the tail ys in a new thunk, either leaving y unchanged or replacing it with 0, thus building up a new stack/chain of thunks in what will be the tail of the list reaching sieve.

For each found prime, sieve adds another mark p r thunk on top, so at the end, when the first prime larger than 2000000 is found and takeWhile (< 2000000) finishes, there will be 148933 levels of thunks.

The stacking of thunks here doesn't influence the complexity, it just influences the constant factors. In the situation we're dealing with, a lazily generated infinite immutable list, there's not much one can do to reduce the time spent transferring control from one thunk to the next. If we were dealing with a finite mutable list or array that is not lazily generated, as one would in a language like C or Java, it would be much better to let each mark p finish its complete job (that would be a simple for loop with less overhead than function calling/transfer of control) before examining the next number, so there'd never be more than one marker active and less control-passing.

You are building up a quadratic number of unevaluated thunks using this algorithm. The algorithm relies on laziness so heavily, that this also is the reason why it doesn't scale.

Let's walk through how it works, which hopefully should make the problem apparent. For simplicitly, let's say that we want to print the elements of primes ad infinitum, i.e. we want to evaluate each cell in the list one after the other. primes is defined as:

primes :: [Integer]
primes = sieve [2..]

Since 2 isn't 0, the second definition of sieve applies, and 2 is added to the list of primes, and the rest of the list is an unevaluated thunk (I use tail instead of the pattern match n : xs in sieve for xs, so tail isn't actually being called, and doesn't add any overhead in the code below; mark is actually the only thunked function):

primes = 2 : sieve (mark (tail [2..]) 1 2)

Now we want the second primes element. So, we walk through the code (exercise for the reader) and end up with:

primes = 2 : 3 : sieve (mark (tail (mark (tail [2..]) 1 2)) 1 3)

Same procedure again, we want to evaluate the next prime...

primes = 2 : 3 : 5 : sieve (mark (tail (tail (mark (tail (mark (tail [2..]) 1 2)) 1 3))) 1 5)

This is starting to look like LISP, but I digress... Starting to see the problem? For each element in the primes list, an increasingly large thunk of stacks of mark applications have to be evaluated. In other words, for each element in the list, there has to be a check for whether that element is marked by any of the preceding primes, by evaluating each mark application in the stack. So, for n~=2000000, the Haskell runtime has to call functions resulting in a call stack with a depth of about ... I don't know, 137900 (let n = 2e6 in n / log n gives a lower bound)? Something like that. This is probably what causes the slow-down; maybe vacuum can tell you more (I don't have a computer with both Haskell and a GUI right now).

The reason why the sieve of Eratosthenes works in languages like C is that:

You aren't using an infinite list.
Because of (1), you can mark the whole array before continuing with the next n, resulting in no call stack overheads at all.

Ok, you are definitely right, it is slower than the naive implementation. I took this one from Wikipedia and compared it to your code with GHCI thus:

-- from Wikipedia
sieveW [] = [] 
sieveW (x:xs) = x : sieveW remaining 
  where 
    remaining = [y | y <- xs, y `mod` x /= 0]

-- your code
sieve :: [Integer] -> [Integer]
sieve (0 : xs) = sieve xs
sieve (n : xs) = n : sieve (mark xs 1 n)
  where
    mark :: [Integer] -> Integer -> Integer -> [Integer]
    mark (y:ys) k m | k == m = 0 : (mark ys 1 m)
                    | otherwise = y : (mark ys (k+1) m)

Running gives

[1 of 1] Compiling Main             ( prime.hs, interpreted )
Ok, modules loaded: Main.
*Main> :set +s
*Main> sum $ take 2000 (sieveW [2..])
16274627
(1.54 secs, 351594604 bytes)
*Main> sum $ take 2000 (sieve [2..])
16274627
(12.33 secs, 2903337856 bytes)

To try and understand what was happening and how exactly the mark code works, I tried expanding the code manually:

  sieve [2..]
= sieve 2 : [3..]
= 2 : sieve (mark [3..] 1 2)
= 2 : sieve (3 : (mark [4..] 2 2))
= 2 : 3 : sieve (mark (mark [4..] 2 2) 1 3)
= 2 : 3 : sieve (mark (0 : (mark [5..] 1 2)) 1 3)
= 2 : 3 : sieve (0 : (mark (mark [5..] 1 2) 1 3))
= 2 : 3 : sieve (mark (mark [5..] 1 2) 1 3)
= 2 : 3 : sieve (mark (5 : (mark [6..] 2 2)) 1 3)
= 2 : 3 : sieve (5 : mark (mark [6..] 2 2) 2 3)
= 2 : 3 : 5 : sieve (mark (mark (mark [6..] 2 2) 2 3) 1 5)
= 2 : 3 : 5 : sieve (mark (mark (0 : (mark [7..] 1 2)) 2 3) 1 5)
= 2 : 3 : 5 : sieve (mark (0 : (mark (mark [7..] 1 2) 3 3)) 1 5)
= 2 : 3 : 5 : sieve (0 : (mark (mark (mark [7..] 1 2) 3 3)) 2 5))
= 2 : 3 : 5 : sieve (mark (mark (mark [7..] 1 2) 3 3)) 2 5)
= 2 : 3 : 5 : sieve (mark (mark (7 : (mark [8..] 2 2)) 3 3)) 2 5)

I think I may have made a slight mistake toward the end there, as the 7 looks like it is about to be turned into a 0 and deleted, but the mechanism is clear. This code is merely creating a set of counters counting up to each prime, emitting the next prime at the correct moment and passing it up the list. This is equivalent to merely checking for division by each previous prime as in the naive implementation, with the additional overhead of passing the 0s or primes between the thunks.

There may be some further subtlety here that I am missing. There is a very detailed treatment of The Sieve of Eratosthenes in Haskell along with various optimisations here.

Why is this prime test so slow?

Tags:

Primes

Haskell

Sieve Of Eratosthenes

Related

Recent Posts