What is the difference between __ldg() intrinsic and a normal execution?
From the CUDA C Programming Guide
Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.
...
Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the
__ldg()
function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use__ldg()
to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both theconst
and__restrict__
qualifiers increases the likelihood that the compiler will detect the read-only condition.
The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).