What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?
When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert()
occurs in device code, and the assert condition is false.
Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.
Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.
Run your code with
cuda-memcheck
. This is a tool something like "valgrind for device code". When you run your code withcuda-memcheck
, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with-lineinfo
. In that scenario, when a device-side assert is triggered,cuda-memcheck
will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead ofassert()
, but the process withassert()
will be similar.It should also be possible to use a debugger. If you use a debugger such as
cuda-gdb
(e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.
Both cuda-memcheck
and the debugger can be used if the CUDA code is launched from a python script.
At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf
in CUDA kernel code, and also a debugger like cuda-gdb
to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).