pyspark Column is not iterable
It's because, you've overwritten the max
definition provided by apache-spark
, it was easy to spot because max
was expecting an iterable
.
To fix this, you can use a different syntax, and it should work.
inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})
or alternatively
from pyspark.sql.functions import max as sparkMax
linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
I know the question is old but this might help someone.
First import the following :
from pyspark.sql import functions as F
Then
linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))
The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import
the Spark SQL functions module
like this:
from pyspark.sql import functions as F
# USAGE: F.col(), F.max(), F.someFunc(), ...
Then, using the OP's
example, you'd simply apply F
like this:
linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
.agg(F.max(F.col("cycle")))
In practice, this is how the problem is avoided idiomatically. =:)