I can't seem to get --py-files on Spark to work

First off, I'll assume that your dependencies are listed in requirements.txt. To package and zip the dependencies, run the following at the command line:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

Above, the cd dependencies command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's post for heads up.

Next, submit the job via:

spark-submit --py-files dependencies.zip spark_job.py

The --py-files directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH (source of confusion for me). To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, spark_job.py:

sc.addPyFile("dependencies.zip")

A caveat from this Cloudera post:

An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client’s CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually.

Although the solution above does not build an egg, the same guideline applies.

First you need to pass your files through --py-files or --files
- When you pass your zip/files with the above flags, basically your resources will be transferred to temporary directory created on HDFS just for the lifetime of that application.
Now in your code, add those zip/files by using the following command

sc.addPyFile("your zip/file")
- what the above does is, it loads the files to the execution environment, like JVM.
Now import your zip/file in your code with an alias like the following to start referencing it

import zip/file as your-alias

Note: You need not use file extension while importing, like .py at the end

Hope this is useful.

To get this dependency distribution approach to work with compiled extensions we need to do two things:

Run the pip install on the same OS as your target cluster (preferably on the master node of the cluster). This ensures compatible binaries are included in your zip.
Unzip your archive on the destination node. This is necessary since Python will not import compiled extensions from zip files. (https://docs.python.org/3.8/library/zipimport.html)

Using the following script to create your dependencies zip will ensure that you are isolated from any packages already installed on your system. This assumes virtualenv is installed and requirements.txt is present in your current directory, and outputs a dependencies.zip with all your dependencies at the root level.

env_name=temp_env

# create the virtual env
virtualenv --python=$(which python3) --clear /tmp/${env_name}

# activate the virtual env
source /tmp/${env_name}/bin/activate

# download and install dependencies
pip install -r requirements.txt

# package the dependencies in dependencies.zip. the cd magic works around the fact that you can't specify a base dir to zip
(cd /tmp/${env_name}/lib/python*/site-packages/ && zip -r - *) > dependencies.zip

The dependencies can now be deployed, unzipped, and included in the PYTHONPATH as so

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --conf 'spark.yarn.dist.archives=dependencies.zip#deps' \
  --conf 'spark.yarn.appMasterEnv.PYTHONPATH=deps' \
  --conf 'spark.executorEnv.PYTHONPATH=deps' \
.
.
.

spark.yarn.dist.archives=dependencies.zip#deps
distributes your zip file and unzips it to a directory called deps

spark.yarn.appMasterEnv.PYTHONPATH=deps
spark.executorEnv.PYTHONPATH=deps
includes the deps directory in the PYTHONPATH for the master and all workers

--deploy-mode cluster
runs the master executor on the cluster so it picks up the dependencies

I can't seem to get --py-files on Spark to work

Tags:

Python

Apache Spark

Pyspark

Related

Recent Posts