PySpark Throwing error Method __getnewargs__([]) does not exist
Using spark
inside flatMap
or any transformation that occures on executors is not allowed (spark
session is available on driver only). It is also not possible to create RDD of RDDs (see: Is it possible to create nested RDDs in Apache Spark?)
But you can achieve this transformation in another way - read all content of all_files.txt
into dataframe, use local map
to make them dataframes and local reduce
to union all, see example:
>>> filenames = spark.read.text('all_files.txt').collect()
>>> dataframes = map(lambda r: spark.read.text(r[0]), filenames)
>>> all_lines_df = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)
I meet this problem today, finally figure out that I refered to a spark.DataFrame
object in pandas_udf
, which result to this error .
The conclution:
You can't use sparkSession
object , spark.DataFrame
object or other Spark distributed objects in udf
and pandas_udf
, because they are unpickled.
If you meet this error and you are using udf
, check it carefully , must be relative problem.