Mathematica Parellelization on HPC
What you need to launch subkernels across several nodes on a HPC cluster is the following:
- Figure out how to request several compute nodes for the same job
- Find the names of the nodes that have been allocated for your job
- Find out how to launch subkernels on these nodes from within the main kernel
All of these depend on the grid engine your cluster is using, as well as your local setup, and you'll need to check its docs and ask your administrator about the details. I have an example for our local setup (complete with a jobfile), which might be helpful for you to study:
https://bitbucket.org/szhorvat/crc/src
Our cluster uses the Sun Grid Engine. The names of the nodes (and information about them) are listed in a "hostfile" which you can find by retrieving the value of the PE_HOSTFILE
environment variable. (I think this works the same way with PBS, except the environment variable is called something else.)
Note that if you request multiple nodes in a single job file, the job script will be run on only one of the nodes, and you'll be launching the processes across all nodes manually (at least on SGE and PBS).
Launching processes on different nodes is usually possible with ssh
: just run ssh nodename command
to run command
. You may also need to set up passphraseless authentication if it is not set up by default. To launch subkernels, you'll need to pass the -f
option to ssh
to let it return immediately after it has launched the remote process.
Some setups use rsh
instead of ssh
. To launch a command in the background using rsh
, you'll need to do
rsh -n nodename "command >& /dev/null &"
To run the remote process in the background, it important to redirect the output (both stdout and stderr) because there's a bug in rsh (also described in its man page) that won't let it return immediately otherwise.
Another thing to keep in mind about rsh
is that you can't rsh
to the local machine, so you'll need to launch the subkernels which will run on the same machine as the main kernel without rsh
.
See my example for details.
Update
The node names in a job can be access through environment variables such as PBS_NODEFILE and HOSTNAME, so that launching subkernels on the correct nodes can be automated.
I'm also trying the run more subKernels from a main kernel on HPC. I usually apply an interaction job on the HPC and run math kernels on it, and then connect back to the front end on may laptop. My waiting time for the queue of the interactive job is very short so it is convenient for me to do the work in the interactive way. Here is how I did, it may not be the same, but hope it would help.
Apply a interative job
qsub -V -I -l walltime=01:00:00,nodes=2:ppn=16 -A hpc_atistartup
it will return something like this:
qsub: waiting for job 48488.mike3 to start qsub: job 48488.mike3 ready -------------------------------------- Running PBS prologue script -------------------------------------- PBS has allocated the following nodes: mike054 mike067 A total of 32 processors on 2 nodes allocated --------------------------------------------- Check nodes and clean them of stray processes --------------------------------------------- Checking node mike054 15:43:46 Checking node mike067 15:43:48 Done clearing all the allocated nodes ------------------------------------------------------ Concluding PBS prologue script - 01-Sep-2013 15:43:48 ------------------------------------------------------ [aaa@mike054 ~]$
We can see I get nodes mike054 and mike067, and the shell is on node mike054.
Start remote master kernel
From the menu of the local front end(my laptop), Evaluation ==> Kernel Configuration Options , add a remote Kernel, here I added one called superMike. Select "Advanced Options" and fill it with "-LinkMode Listen -LinkProtocol TCPIP".
Then run a command in a notebook, for example $Version
. It would pop out a window like this:
The port and IP address should be different than mine.
With this pop up window opened, go to your shell at the HPC we just get, run the command math
to launch command line mathematica. After I get the mathematica shell, enter
$ParentLink = LinkConnect["[email protected],[email protected]", LinkProtocol->"TCPIP"]
and hit enter. Then hit the "OK" key of that pop up window. If it successfully connected, it would pop up a message window with
Out[1]:=LinkObject[[email protected],[email protected], 59, 2]
and the $Version
command should return the results:
For details of the remote kernel connection, see the post here.
Start subKernels
Open the Remote Kernels tab in Evaluation ==> Parallel Kernel Configuration, clink "Add Host" to add other nodes we get in the interactive job. In this case I get nodes mike054 and mike067, and the shell I get is on node mike054. So I will add mike067 by fill the Hostname, set the number of kernels and check "Enable".
After that we can go to Evaluation ==> Parallel Kernel status, and check whether the subKernel are working. If everything went successfully we can see something like this
We can see that we've launched 16 subKernels on node mike054 and 16 subKernels on node mike067.
Hope it will help.