pyspark count rows on condition
Based on @Psidom answer, my answer is as following
from pyspark.sql.functions import col,when,count
test.groupBy("x").agg(
count(when(col("y") > 12453, True)),
count(when(col("z") > 230, True))
).show()
count
doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum
:
import pyspark.sql.functions as F
cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
test.groupBy('x').agg(
cnt_cond(F.col('y') > 12453).alias('y_cnt'),
cnt_cond(F.col('z') > 230).alias('z_cnt')
).show()
+---+-----+-----+
| x|y_cnt|z_cnt|
+---+-----+-----+
| bn| 0| 0|
| mb| 2| 2|
+---+-----+-----+