Calculating percentage of total count for groupBy using pyspark

An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

# Running in Databricks, not all stuff required
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
#from pyspark.sql.functions import col

data = [("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
        ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)]
rdd = sc.parallelize(data)

someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))

df = sqlContext.createDataFrame(someschema)

tot = df.count()

df.groupBy("c1") \
  .count() \
  .withColumnRenamed('count', 'cnt_per_group') \
  .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
  .show()

returns:

 +---+-------------+-------------------+
| c1|cnt_per_group|perc_of_count_total|
+---+-------------+-------------------+
|  E|            1| 16.666666666666664|
|  B|            1| 16.666666666666664|
|  D|            1| 16.666666666666664|
|  C|            1| 16.666666666666664|
|  A|            2|  33.33333333333333|
+---+-------------+-------------------+

I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().


You can groupby and aggregate with agg. For example, for the following DataFrame:

+--------+-----+
|category|value|
+--------+-----+
|       a|    1|
|       b|    2|
|       a|    3|
+--------+-----+

You can use:

import pyspark.sql.functions as F

df.groupby('category').agg(
    (F.count('value')).alias('count'),
    (F.count('value') / df.count()).alias('percentage')
).show()

Output:

+--------+-----+------------------+
|category|count|        percentage|
+--------+-----+------------------+
|       b|    1|0.3333333333333333|
|       a|    2|0.6666666666666666|
+--------+-----+------------------+

Alternatively, you can use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT category,
           COUNT(*) AS count,
           COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
    FROM df
    GROUP BY category
    """
).show()