PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
This question was asked a few years ago, but if anyone else comes across it, as of Spark v2.3 this has been added as a feature. Now this is as simple as (assumes canon_evt
is a dataframe with timestamp column dt
that we want to remove the seconds from)
from pyspark.sql.functions import date_trunc
canon_evt = canon_evt.withColumn('dt', date_trunc('minute', canon_evt.dt))
Spark >= 2.3
You can use date_trunc
df.withColumn("dt_truncated", date_trunc("minute", col("dt"))).show()
## +-------------------+-------------------+
## | dt| dt_truncated|
## +-------------------+-------------------+
## |1970-01-01 00:00:00|1970-01-01 00:00:00|
## |2015-09-16 05:39:46|2015-09-16 05:39:00|
## |2015-09-16 05:40:46|2015-09-16 05:40:00|
## |2016-03-05 02:00:10|2016-03-05 02:00:00|
## +-------------------+-------------------+
Spark < 2.3
Converting to Unix timestamps and basic arithmetics should to the trick:
from pyspark.sql import Row
from pyspark.sql.functions import col, unix_timestamp, round
df = sc.parallelize([
Row(dt='1970-01-01 00:00:00'),
Row(dt='2015-09-16 05:39:46'),
Row(dt='2015-09-16 05:40:46'),
Row(dt='2016-03-05 02:00:10'),
]).toDF()
## unix_timestamp converts string to Unix timestamp (bigint / long)
## in seconds. Divide by 60, round, multiply by 60 and cast
## should work just fine.
##
dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60)
.cast("timestamp"))
df.withColumn("dt_truncated", dt_truncated).show(10, False)
## +-------------------+---------------------+
## |dt |dt_truncated |
## +-------------------+---------------------+
## |1970-01-01 00:00:00|1970-01-01 00:00:00.0|
## |2015-09-16 05:39:46|2015-09-16 05:40:00.0|
## |2015-09-16 05:40:46|2015-09-16 05:41:00.0|
## |2016-03-05 02:00:10|2016-03-05 02:00:00.0|
## +-------------------+---------------------+