pyspark addPyFile to add zip of .py files, but module still not found
Fixed problem. Admittedly, solution is not totally spark-related, but leaving question posted for the sake of others who may have similar problem, since the given error message did not make my mistake totally clear from the start.
TLDR: Make sure the package contents (so they should include an __init.py__ in each dir.) of the zip file being loaded in are structured and named the way your code expects.
The package I was trying to load into the spark context via zip was of the form
mypkg
file1.py
file2.py
subpkg1
file11.py
subpkg2
file21.py
my zip when running less mypkg.zip
, showed
file1.py file2.py subpkg1 subpkg2
So two things were wrong here.
- Was not zipping the toplevel dir. that was the main package that the coded was expecting to work with
- Was not zipping the lower level dirs.
Solved with
zip -r mypkg.zip mypkg
More specifically, had to make 2 zip files
for the dist-keras package:
cd dist-keras; zip -r distkeras.zip distkeras
see https://github.com/cerndb/dist-keras/tree/master/distkeras
for the keras package used by distkeras (which is not installed across the cluster):
cd keras; zip -r keras.zip keras
see https://github.com/keras-team/keras/tree/master/keras
So declaring the spark session looked like
conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master) #master='yarn-client'
conf.set("spark.executor.cores", `num_cores`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
# Check if the user is running Spark 2.0 +
if using_spark_2:
from pyspark.sql import SparkSession
sc = SparkSession.builder.config(conf=conf) \
.appName(application_name) \
.getOrCreate()
sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/dist-keras/distkeras.zip")
sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/keras/keras.zip")
print sc.version
if your module is as below
myModule \n
- init.py
-spark1.py
-spark2.py
Don't go inside myModule folder and add to zip. This error you mentioned.
Instead, go outside the myModule folder. right-click and add myModule folder to zip and give another name.
The idea is when spark extract your zip, there should be myModule folder exist with same name and hyrarchy