CUDA on Tesla V100 card
The CUDALink package doesn't work right on Linux. For years and years I've been reporting it, they don't fix it. You have to go through some hoops, but when you get it working, it's really sweet! My recommendation:
- first ensure that your CUDA system itself works, i. e. without M. Compile/run the samples, write your own toy examples, etc.
- I've been successful adding
export PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64
export NVIDIA_DRIVER_LIBRARY_PATH=/usr/bin/
export CUDA_LIBRARY_PATH=/usr/lib64/libcuda.so
at the top of my /etc/profile file. I think you can also add that to your user's .bashrc, but I didn't try it, I prefer to have this in the profile.
you may need to make some modifications, you're on CentOS, and I'm on Fedora, but it should in principle be the same, because they're compatible .rpm distros. I believe changes to the above shouldn't be necessary on Fedora, CentOS, and RHEL.
ensure you have the correct driver installed. There is one that ships with the CUDA installation, but I found this doesn't always update properly through rpmfusion (not NVidia's fault, it's the rpmfusion peeps that don't get this right). So download and install the latest version directly (I prefer to use the .run file directly. If on #rpmfusion they tell you differently, ignore. Use the .run file).
now the CUDALink package should be available. But: you still have to keep two things in mind:
you need to make a one-time dummy evaluation before you can use CUDAFunctionLoad. Calling it twice or something else (look at the installed compilers, or compiler information) does the trick. You need to do this only once after you loaded the package. But you always have to do this dummy evaluation after you reload the package!
for CUDAFunctionLoad you need to specify "CompilerInstallation" -> "/usr/local/cuda-10.1/bin/" and "XCompilerInstallation" -> "/usr/bin/"
in the options.
Now it works like a charm. The possibilities with CUDAFunctionLoad / CUDAFunction are just mind-boggling. This is 99% or more of what I use from that package. I'm thoroughly impressed, it's even easier than using your CUDA code in a file and then compiling it outside of M. That needs tons of other files and dependencies and the proper compilation line and the linker ... CUDAFunctionLoad takes care of all that. I find it the simplest way to work with CUDA, and it directly integrates with your workflow.
If you do it the way I described above you can also ignore all that paclet stuff that M installs, the old version that M installs there ... if you make these links in the profile correctly, you can use the latest and greatest version that you have installed.
Give it a try, let me know where you get stuck.
In the end, in our case, the following allowed us to get CUDA
to work.
export NVIDIA_DRIVER_LIBRARY_PATH /usr/lib64/libnvidia-tls.so.418.116.00
(adapt to your usage).
Then in mathematica
Needs["CUDALink`"]
and
CUDAQ[]
launches the download of about 4.3 GBytes (!!) of data in .Mathematica
(most of it in Paclets/Repository/CUDAResources-Lin64-12.0.346/CUDAToolkit
), so be patient, but eventually it returns
(* True *)
Note incidentally that the machine obviously needs to be able to access the internet.
Then the sky is the limit :0)
width = 1024; height = 768;
iconfig = {width, height, 1, 0, 1, 6};
config = {0.001, 0.0, 0.0, 0.0, 8.0, 15.0, 10.0, 5.0};
camera = {{2.0, 2.0, 2.0}, {0.0, 0.0, 0.0}}; AppendTo[camera,
Normalize[camera[[2]] - camera[[1]]]]; AppendTo[camera,
0.75*Normalize[
Cross[camera[[3]], {0.0, 1.0, 0.0}]]]; AppendTo[camera,
0.75*Normalize[Cross[camera[[4]], camera[[3]]]]];
config = Join[{config, Flatten[camera]}];
pixelsMem = CUDAMemoryAllocate["Float", {height, width, 3}];
srcf =
FileNameJoin[{$CUDALinkPath, "SupportFiles", "mandelbulb.cu"}];
mandelbulb =
CUDAFunctionLoad[{srcf},
"MandelbulbGPU", {{"Float", _, "Output"}, {"Float", _,
"Input"}, {"Integer32", _, "Input"}, "Integer32", "Float",
"Float"}, {16}, "UnmangleCode" -> False];
mandelbulb[pixelsMem,
Flatten[config], iconfig, 0, 0.0, 0.0, {width*height*3}];
pixels = CUDAMemoryGet[pixelsMem];Image[pixels]
And, following @Fortsaint 's comment:
data = Flatten@Table[{x, y} -> Exp[-Norm[{x, y}]], {x,-3,3,.005}, {y,-3,3,.005}];
net = NetChain[{32, Tanh, 1}];
trained = NetTrain[net, data, BatchSize -> 1024, "TargetDevice" -> "GPU"]
Starting training. Optimization Method: ADAM Device: GPU Batch Size: 1024 Batches Per Round: 1409 Training Examples: 1442401 .... Training complete.
which is 16 times faster with the GPU.
This is a reply to a question from chris, not a "new" answer.
You mean a working example? How about 250 million normally-distributed random numbers, displayed in a histo:
Needs["CUDALink`"]
srcf = FileNameJoin[{$CUDALinkPath, "SupportFiles", "random.cu"}]
CUDAInverseCND =
CUDAFunctionLoad[{srcf},
"InverseCND", {{_Real, _, "InputOutput"}, _Integer, _Integer}, 256,
"CompilerInstallation" -> "/usr/local/cuda-10.1/bin/",
"XCompilerInstallation" -> "/usr/bin/"]
(*evaluate the cell above twice*)
sampleCount = 250000000;
mem = CUDAMemoryAllocate[Real, sampleCount];
CUDAInverseCND[mem, sampleCount, 0];
samples = CUDAMemoryGet[mem];
Histogram[samples, 1000, "ProbabilityDensity"]
this should take two seconds for the CUDA part, the histo will take longer, as it now has to deal with 250 million data points.
you can now analyse the memory with
CUDAMemoryInformation@mem
the important parts are:
HostStatus->Synchronized,DeviceStatus->Synchronized,Residence->DeviceHost,Sharing->Shared,Type->Double,ByteCount->2000000000,Dimensions->{250000000}
when HostStatus and DeviceStatus are both synchronized, all is good. Before the CUDAMemoryGet CUDAMemoryInformation@mem will say HostStatus->Unsynchronized -- no memcopy to the host happened yet.
when done you should
CUDAMemoryUnload@mem
to free the memory on your GPU (this takes about 2 GB of GPU memory)
HTH