Column alias after groupBy in pyspark
This is because you are aliasing the whole DataFrame
object, not Column
. Here's an example how to alias the Column
only:
import pyspark.sql.functions as func
grpdf = joined_df \
.groupBy(temp1.datestamp) \
.max('diff') \
.select(func.col("max(diff)").alias("maxDiff"))
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions
:
1
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.selectExpr('max(diff) AS maxDiff')
See docs for info on .selectExpr()
2
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.withColumnRenamed('max(diff)', 'maxDiff')
See docs for info on .withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
you can use.
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()