Calculating percentage of total count for groupBy using pyspark
An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:
# Running in Databricks, not all stuff required
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
#from pyspark.sql.functions import col
data = [("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))
df = sqlContext.createDataFrame(someschema)
tot = df.count()
df.groupBy("c1") \
.count() \
.withColumnRenamed('count', 'cnt_per_group') \
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \
.show()
returns:
+---+-------------+-------------------+
| c1|cnt_per_group|perc_of_count_total|
+---+-------------+-------------------+
| E| 1| 16.666666666666664|
| B| 1| 16.666666666666664|
| D| 1| 16.666666666666664|
| C| 1| 16.666666666666664|
| A| 2| 33.33333333333333|
+---+-------------+-------------------+
I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().
You can groupby
and aggregate with agg
. For example, for the following DataFrame:
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
You can use:
import pyspark.sql.functions as F
df.groupby('category').agg(
(F.count('value')).alias('count'),
(F.count('value') / df.count()).alias('percentage')
).show()
Output:
+--------+-----+------------------+
|category|count| percentage|
+--------+-----+------------------+
| b| 1|0.3333333333333333|
| a| 2|0.6666666666666666|
+--------+-----+------------------+
Alternatively, you can use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT category,
COUNT(*) AS count,
COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
).show()