How to subtract a column of days from a column of dates in Pyspark?
I was able to solve this using selectExpr
.
df.selectExpr('date_sub(date_col, day_col) as subtracted_dates')
If you want to append the column to the original DF, just add *
to the expression
df.selectExpr('*', 'date_sub(date_col, day_col) as subtracted_dates')
Use expr
function (if you have dynamic values
from columns to substract):
>>> from pyspark.sql.functions import *
>>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)"))
Use withColumn function(if you have literal values
to substract):
>>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>))
Not the most elegant solution ever but if you don't want to hack SQL expressions in Scala (not that it should be hard, but these are private to sql
) something like this should do the trick:
from pyspark.sql import Column
def date_sub_(c1: Column, c2: Column) -> Column:
return ((c1.cast("timestamp").cast("long") - 60 * 60 * 24 * c2)
.cast("timestamp").cast("date"))
For Python 2.x just drop type annotations.