Running Spark jobs on a YARN cluster with additional files

You may want to try and use local:// and the $SPARK_YARN_STAGING_DIR env var.

For example the following should work:

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --files /absolute/path/to/local/test.py \
    --class somepackage.PythonLauncher \
    local://$SPARK_YARN_STAGING_DIR/test.py

To understand why, you must get familiar with the differences of the three running mode of spark, eg. standalone, yarn-client, yarn-cluster.

As with standalone and yarn-client, driver program runs at the current location of your local machine while worker program runs somewhere else(standalone maybe another temp directory under $SPARK_HOME, yarn-client maybe a random node in the cluster), so you can access local file with local path specified in the driver program but not in the worker program.

However, when you run with yarn-cluster mode, both your driver and worker program run at a random cluster node, local files are relative to their working machine and directory, thereby a file-not-found exception throws, you need to archive these files with either --files or --archive when submitting, or just archive them in .egg or .jar yourself before submit, or use addFile api in your driver program like this.

I don't use python myself but I find some clues may be useful for you (in the source code of Spark-1.3 SparkSubmitArguments)

--py-files PY_FILES, Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--files FILES, Comma-separated list of files to be placed in the working directory of each executor.
--archives ARCHIVES, Comma separated list of archives to be extracted into the working directory of each executor.

And also, your arguments to spark-submit should follow this style:

Usage: spark-submit [options] <app jar | python file> [app arguments]

Running Spark jobs on a YARN cluster with additional files

Tags:

Hdfs

Apache Spark

Yarn

Related

Recent Posts