Parallel Reduction
Basically, it is performing the operation shown in the picture below:
This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.
You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]
:
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
So:
thread 0 of block = 0 will copy the position 0 and 512,
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047
The blockDim.x*2
is used because each thread will access to position i
and i+blockDim.x
so you need to multiple by 2
to guarantee that the threads on next id
block do not compute the position of g_idata
already computed.