get all the dates between two dates in Spark DataFrame
Well, you can do following.
Create a dataframe with dates only:
dates_df
# with all days between first bookingDt
and last arrivalDt
and then join those df with between condition:
df.join(dates_df,
on=col('dates_df.dates').between(col('df.bookindDt'), col('dt.arrivalDt'))
.select('df.*', 'dates_df.dates')
It might work even faster then solution with explode
, however you need to figure out what is start and end date for this df.
10 years df will have just 3650 records not that many to worry about.
As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr()
:
- Create a dummy string of repeating commas with a length equal to
diffDays
- Split this string on
','
to turn it into an array of sizediffDays
- Use
pyspark.sql.functions.posexplode()
to explode this array along with its indices - Finally use
pyspark.sql.functions.date_add()
to add the index value number of days to thebookingDt
Code:
import pyspark.sql.functions as f
diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
.select("*", f.posexplode("repeat").alias("txnDt", "val"))\
.drop("repeat", "val", "diffDays")\
.withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
.show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id| txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01| 1000|2018-01-01|
#|2018-01-05|2018-01-01| 1000|2018-01-02|
#|2018-01-05|2018-01-01| 1000|2018-01-03|
#|2018-01-05|2018-01-01| 1000|2018-01-04|
#|2018-01-05|2018-01-01| 1000|2018-01-05|
#+----------+----------+-------+----------+
For Spark 2.4+ sequence can be used to create an array containg all dates between bookingDt
and arrivalDt
. This array can then be exploded.
from pyspark.sql import functions as F
df = df \
.withColumn('bookingDt', F.col('bookingDt').cast('date')) \
.withColumn('arrivalDt', F.col('arrivalDt').cast('date'))
df.withColumn('txnDt', F.explode(F.expr('sequence(bookingDt, arrivalDt, interval 1 day)')))\
.show()
Output:
+-------+----------+----------+----------+
|vyge_id| bookingDt| arrivalDt| txnDt|
+-------+----------+----------+----------+
| 1000|2018-01-01|2018-01-05|2018-01-01|
| 1000|2018-01-01|2018-01-05|2018-01-02|
| 1000|2018-01-01|2018-01-05|2018-01-03|
| 1000|2018-01-01|2018-01-05|2018-01-04|
| 1000|2018-01-01|2018-01-05|2018-01-05|
+-------+----------+----------+----------+