About warp voting function
The prototype of __ballot
is the following
unsigned int __ballot(int predicate);
If predicate
is nonzero, __ballot
returns a value with the N
th bit set, where N
is the thread index.
Combined with atomicOr
and __popc
, it can be used to accumulate the number of threads in each warp having a true predicate.
Indeed, the prototype of atomicOr
is
int atomicOr(int* address, int val);
and atomicOr
reads the value pointed to by address
, performs a bitwise OR
operation with val
, and writes the value back to address
and returns its old value as a return parameter.
On the other side, __popc
returns the number of bits set withing a 32
-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];
const u32 warp_sum = threadIdx.x >> 5;
atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));
atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann
__ballot
is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc
intrinsic to make a very efficient implementation of boolean reduction.
__all
and __any
was used in reduction before introduction of __ballot
, though I can not think of any other use of them.