Date difference between consecutive rows - Pyspark Dataframe
Another way could be:
from pyspark.sql.functions import lag
from pyspark.sql.window import Window
df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1)
.over(Window.partitionBy("user_id")
.orderBy("date")))
.cast("bigint"))
EDITED thanks to @cool_kid
@Joesemy answer is really good but didn't work for me since cast("bigint") threw an error. So I used the datediff function from the pyspark.sql.functions module this way and it worked :
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df.withColumn("time_intertweet", datediff(df.date, lag(df.date, 1)
.over(Window.partitionBy("user_id")
.orderBy("date"))))
Like this:
df.registerTempTable("df")
sqlContext.sql("""
SELECT *, CAST(date AS bigint) - CAST(lag(date, 1) OVER (
PARTITION BY user_id ORDER BY date) AS bigint)
FROM df""")