Parallel prefix sum - fastest Implementation
I implemented only sum of all elements in an array (the up-sweep reduce part of Blelloch), not the full prefix sum using Aparapi (https://code.google.com/p/aparapi/) in java/opencl. Its available at https://github.com/klonikar/trial-aparapi/blob/master/src/trial/aparapi/Reducer.java and it's written for a general block size (called localBatchSize in code) instead of 2. I found that block size of 8 works best for my GPU.
While the implementation works (sum computation is correct), it performs much worse than sequential sum. On my core-i7 (8 core) CPU, sequential sum takes about 12ms for 8388608 (8MB) numbers, the parallelized execution on GPU (NVidia Quadro K2000M with 384 cores) takes about 100ms. I have even optimized to transfer only the final sum after the computation and not the entire array. Without this optimization, it takes 20ms more. The implementation seems to be according to algorithm described in answer by @marcel-valdez-orozco.
The answer to this question is here: Parallel Prefix Sum (Scan) with CUDA and here: Prefix Sums and Their Applications. The NVidia article provides the best possible implementation using CUDA GPUs, and the Carnegie Mellon University PDF paper explains the algorithm. I also implemented an O(n/p) prefix sum using MPI, which you can find here: In my github repo.
This is the pseudocode for the generic algorithm (platform independent):
Example 3. The Up-Sweep (Reduce) Phase of a Work-Efficient Sum Scan Algorithm (After Blelloch 1990)
for d = 0 to log2(n) – 1 do
for all k = 0 to n – 1 by 2^(d+1) in parallel do
x[k + 2^(d+1) – 1] = x[k + 2^d – 1] + x[k + 2^(d+1) – 1]
Example 4. The Down-Sweep Phase of a Work-Efficient Parallel Sum Scan Algorithm (After Blelloch 1990)
x[n – 1] = 0
for d = log2(n) – 1 down to 0 do
for all k = 0 to n – 1 by 2^(d+1) in parallel do
t = x[k + 2^d – 1]
x[k + 2^d – 1] = x[k + 2^(d+1) – 1]
x[k + 2^(d+1) – 1] = t + x[k + 2^(d+1) – 1]
Where x is the input data, n is the size of the input and d is the degree of parallelism (number of CPUs). This is a shared memory computation model, if it used distributed memory you'd need to add communication steps to that code, as I did in the provided Github example.