CUDA runtime error (59) : device-side assert triggered
In general, when encountering cuda runtine error
s, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1
flag to obtain an accurate stack trace.
In your specific case, the targets of your data were too high (or low) for the specified number of classes.
I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.
One way to raise the "CUDA error: device-side assert triggered" RuntimeError
, is by indexing into a GPU torch.Tensor
using a list
having out of dimension indices.
So, this snippet would raise an IndexError
with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error
data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]
whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError
data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]
which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes
) that is causing the error, when the labels start from 1 rather than 0.
Also, when device is "cpu"
the error thrown is IndexError
such as the one thrown by the first snippet.
This is usually an indexing issue.
For example, if your ground truth label starts at 1:
target = [1,2,3,4,5]
Then you should subtract 1
for every label instead so that:
target = [0,1,2,3,4]