Templated CUDA kernel with dynamic shared memory
(A variation on @RobertCrovella's answer)
NVCC is not willing to accept two extern __shared__
arrays of the same name but different types - even if they're never in each other's scope. We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.
So we replace this instruction:
extern __shared__ T smem[];
with this one:
auto smem = shared_memory_proxy<T>();
where:
template <typename T>
__device__ T* shared_memory_proxy()
{
// do we need an __align__() here? I don't think so...
extern __shared__ unsigned char memory[];
return reinterpret_cast<T*>(memory);
}
is in some device-side code include file.
Advantages:
- One-liner at the site of use.
- Simpler syntax to remember.
- Separation of concerns - whoever reads the kernel doesn't have to think about why s/he's seeing
extern
, or alignment specifiers, or a reinterpret cast etc.
edit: This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh
(where it's named shared_memory::dynamic::proxy()
).
Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:
replace this:
extern __shared__ T smem[];
with this:
extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);
You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.
EDIT: updated my answer to reflect the comment by @njuffa.