Functions from custom module not working in PySpark, but they work when inputted in interactive mode
I had the same error and followed the stack trace.
In my case, I was building an Egg file and then passing it to spark via the --py-files
option.
Concerning the error, I think it boils down to the fact that when you call F.udf(str2num, t.IntegerType())
a UserDefinedFunction
instance is created before Spark is running, so it has an empty reference to some SparkContext
, call it sc
. When you run the UDF, sc._pickled_broadcast_vars
is referenced and this throws the AttributeError
in your output.
My work around is to avoid creating the UDF until Spark is running (and hence there is an active SparkContext
. In your case, you could just change your definition of
def letConvNum(df): # df is a PySpark DataFrame
#Get a list of columns that I want to transform, using the metadata Pandas DataFrame
chng_cols=metadta[(metadta.comments=='letter conversion to num')].col_name.tolist()
str2numUDF = F.udf(str2num, t.IntegerType()) # create UDF on demand
for curcol in chng_cols:
df=df.withColumn(curcol, str2numUDF(df[curcol]))
return df
Note: I haven't actually tested the code above, but the change in my own code was similar and everything worked fine.
Also, for the interested reader, see the Spark code for UserDefinedFunction