How can I run Spark on a cluster using Slurm?

In order to run an application using a spark context it is first necessary to run a Slurm job which starts a master and some workers. There are some things you will have to watch out for when using Slurm:

don't start Spark as a daemon
make the Spark workers use only as much cores and memory as requested for the Slurm job
in order to run master and worker in the same job you will have to branch somewhere in your script

I'm working with the Linux binaries installed to $HOME/spark-1.5.2-bin-hadoop2.6/. Remember to replace <username> and <shared folder> with some valid values in the script.

#!/bin/bash
#start_spark_slurm.sh

#SBATCH --nodes=3
#  ntasks per node MUST be one, because multiple slaves per work doesn't
#  work well with slurm + spark in this script (they would need increasing 
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#  Beware! $HOME will not be expanded and invalid paths will result Slurm jobs
#  hanging indefinitely with status CG (completing) when calling scancel!
#SBATCH --output="/home/<username>/spark/logs/%j.out"
#SBATCH --error="/home/<username>/spark/logs/%j.err"
#SBATCH --time=01:00:00

# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
    this=$0
    # I experienced problems with some nodes not finding the script:
    #   slurmstepd: execve(): /var/spool/slurm/job123/slurm_script:
    #   No such file or directory
    # that's why this script is being copied to a shared location to which 
    # all nodes have access to:
    script=/<shared folder>/${SLURM_JOBID}_$( basename -- "$0" )
    cp "$this" "$script"

    # This might not be necessary on all clusters
    module load scala/2.10.4 java/jdk1.7.0_25 cuda/7.0.28

    export sparkLogs=$HOME/spark/logs
    export sparkTmp=$HOME/spark/tmp
    mkdir -p -- "$sparkLogs" "$sparkTmp"

    export SPARK_ROOT=$HOME/spark-1.5.2-bin-hadoop2.6/
    export SPARK_WORKER_DIR=$sparkLogs
    export SPARK_LOCAL_DIRS=$sparkLogs
    export SPARK_MASTER_PORT=7077
    export SPARK_MASTER_WEBUI_PORT=8080
    export SPARK_WORKER_CORES=$SLURM_CPUS_PER_TASK
    export SPARK_DAEMON_MEMORY=$(( $SLURM_MEM_PER_CPU * $SLURM_CPUS_PER_TASK / 2 ))m
    export SPARK_MEM=$SPARK_DAEMON_MEMORY

    srun "$script" 'srunning'
# If run by srun, then decide by $SLURM_PROCID whether we are master or worker
else
    source "$SPARK_ROOT/sbin/spark-config.sh"
    source "$SPARK_PREFIX/bin/load-spark-env.sh"
    if [ "$SLURM_PROCID" -eq 0 ]; then
        export SPARK_MASTER_IP=$( hostname )
        MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )

        # The saved IP address + port is necessary alter for submitting jobs
        echo "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" > "$sparkLogs/${SLURM_JOBID}_spark_master"

        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.master.Master \
            --ip "$SPARK_MASTER_IP"                                         \
            --port "$SPARK_MASTER_PORT "                                    \
            --webui-port "$SPARK_MASTER_WEBUI_PORT"
    else
        # $(scontrol show hostname) is used to convert e.g. host20[39-40]
        # to host2039 this step assumes that SLURM_PROCID=0 corresponds to 
        # the first node in SLURM_NODELIST !
        MASTER_NODE=spark://$( scontrol show hostname $SLURM_NODELIST | head -n 1 ):7077
        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.worker.Worker $MASTER_NODE
    fi
fi

Now to start the sbatch job and after that example.jar:

mkdir -p -- "$HOME/spark/logs"
jobid=$( sbatch ./start_spark_slurm.sh )
jobid=${jobid##Submitted batch job }
MASTER_WEB_UI=''
while [ -z "$MASTER_WEB_UI" ]; do 
    sleep 1s
    if [ -f "$HOME/spark/logs/$jobid.err" ]; then
        MASTER_WEB_UI=$( sed -n -r 's|.*Started MasterWebUI at (http://[0-9.:]*)|\1|p' "$HOME/spark/logs/$jobid.err" )
    fi
done
MASTER_ADDRESS=$( cat -- "$HOME/spark/logs/${jobid}_spark_master" ) 
"$HOME/spark-1.5.2-bin-hadoop2.6/bin/spark-submit" --master "$MASTER_ADDRESS" example.jar
firefox "$MASTER_WEB_UI"

As maxmlnkn answer states, you need a mechanism to setup/launch the appropriate Spark daemons in a Slurm allocation before a Spark jar can be executed via spark-submit.

Several scripts/systems to do this setup for you have been developed. The answer you linked above mentions Magpie @ https://github.com/LLNL/magpie (full disclosure: I'm the developer/maintainer of those scripts). Magpie provides a job submission file (submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark) for you to edit and put your cluster specifics & job scripts in to execute. Once configured you'd submit this via 'sbatch -k ./magpie.sbatch-srun-spark'). See doc/README.spark for more details.

I will mention there are other scripts/systems to do this for you. I lack experience with them, so can't comment beyond just linking them below.

https://github.com/glennklockwood/myhadoop

https://github.com/hpcugent/hanythingondemand

How can I run Spark on a cluster using Slurm?

Tags:

Apache Spark

Related

Recent Posts