Group spark dataframe by date
Since 1.5.0 Spark provides a number of functions like dayofmonth
, hour
, month
or year
which can operate on dates and timestamps. So if timestamp
is a TimestampType
all you need is a correct expression. For example:
from pyspark.sql.functions import hour, mean
(df
.groupBy(hour("timestamp").alias("hour"))
.agg(mean("value").alias("mean"))
.show())
## +----+------------------+
## |hour| mean|
## +----+------------------+
## | 0|508.05999999999995|
## | 1| 449.8666666666666|
## | 2| 524.9499999999999|
## | 3|264.59999999999997|
## +----+------------------+
Pre-1.5.0 your best option is to use HiveContext
and Hive UDFs either with selectExpr
:
df.selectExpr("year(timestamp) AS year", "value").groupBy("year").sum()
## +----+---------+----------+
## |year|SUM(year)|SUM(value)|
## +----+---------+----------+
## |2015| 40300| 9183.0|
## +----+---------+----------+
or raw SQL:
df.registerTempTable("df")
sqlContext.sql("""
SELECT MONTH(timestamp) AS month, SUM(value) AS values_sum
FROM df
GROUP BY MONTH(timestamp)""")
Just remember that aggregation is performed by Spark not pushed-down to the external source. Usually it is a desired behavior but there are situations when you may prefer to perform aggregation as a subquery to limit data transfer.
Also, you can use date_format to create any time period you wish. Groupby specific day:
from pyspark.sql import functions as F
df.select(F.date_format('timestamp','yyyy-MM-dd').alias('day')).groupby('day').count().show()
Groupby specific month (just change the format):
df.select(F.date_format('timestamp','yyyy-MM').alias('month')).groupby('month').count().show()