Does __syncthreads() synchronize all threads in the grid?

I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.

Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.

Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.

That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.

If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).


The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].

Example of using __syncthreads(): (source)

__global__ void globFunction(int *arr, int N) 
{
    __shared__ int local_array[THREADS_PER_BLOCK];  //local block memory cache           
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;

    //...calculate results
    local_array[threadIdx.x] = results;

    //synchronize the local threads writing to the local memory cache
    __syncthreads();

    // read the results of another thread in the current thread
    int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];

    //write back the value to global memory
    arr[idx] = val;        
}

To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.

Example of using this technique (source):

CPU synchronization

Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.


__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.

If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.

Tags:

Cuda