Python Spark Cumulative Sum by Group Using DataFrame
To make an update from previous answers. The correct and precise way to do is :
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rowsBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
| 1| 3| b| 3|
| 2| 3| b| 6|
| 1| 2| a| 2|
| 2| 2| a| 4|
| 3| 2| a| 6|
+----+-----+-----+-------+