How do I truncate a PySpark dataframe of timestamp type to the day?
You use wrong function. trunc
supports only a few formats:
Returns date truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'
Use date_trunc
instead:
Returns timestamp truncated to the unit specified by the format.
:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', 'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'
Example:
from pyspark.sql.functions import col, date_trunc
df = spark.createDataFrame(["2018-04-07 23:33:21"], "string").toDF("dt").select(col("dt").cast("timestamp"))
df.select(date_trunc("day", "dt")).show()
# +-------------------+
# |date_trunc(day, dt)|
# +-------------------+
# |2018-04-07 00:00:00|
# +-------------------+