How to convert date to the first day of month in a PySpark Dataframe column?
You can use trunc
:
import pyspark.sql.functions as f
df.withColumn("first_date", f.trunc("date", "month")).show()
+----------+----------+
| date|first_date|
+----------+----------+
|2017-11-25|2017-11-01|
|2017-12-21|2017-12-01|
|2017-09-12|2017-09-01|
+----------+----------+
You can get the beginning of the month with the trunc
function (as Alper) mentioned or with the date_trunc
method. The trunc
function returns a date column and the date_trunc
function returns a time column. Suppose you have the following DataFrame:
+----------+
| some_date|
+----------+
|2017-11-25|
|2017-12-21|
|2017-09-12|
| null|
+----------+
Run the trunc
and date_trunc
functions:
datesDF\
.withColumn("beginning_of_month_date", trunc(col("some_date"), "month"))\
.withColumn("beginning_of_month_time", date_trunc("month" ,col("some_date")))\
.show()
Observe the result:
+----------+-----------------------+-----------------------+
| some_date|beginning_of_month_date|beginning_of_month_time|
+----------+-----------------------+-----------------------+
|2017-11-25| 2017-11-01| 2017-11-01 00:00:00|
|2017-12-21| 2017-12-01| 2017-12-01 00:00:00|
|2017-09-12| 2017-09-01| 2017-09-01 00:00:00|
| null| null| null|
+----------+-----------------------+-----------------------+
Print the schema to confirm the column types:
root
|-- some_date: date (nullable = true)
|-- beginning_of_month_date: date (nullable = true)
|-- beginning_of_month_time: timestamp (nullable = true)
Scala users should use the beginningOfMonthDate
and beginningOfMonthTime
functions defined in spark-daria.
PySpark users should use the beginning_of_month_date
and beginning_of_month_time
functions defined in quinn.
Notice how the trunc
function takes the column argument first and the date_trunc
takes the column argument second. The trunc
method is poorly named - it's part of the functions package, so it's easy to mistakenly think this function is for string truncation. It's surprising that date_trunc
is returning a timestamp result... sounds like it should return a date result.
Just make sure to wrap these functions with descriptive function / UDF names so your code is readable. See here for more info.