Apache Spark: Get number of records per partition
I'd use built-in function. It should be as efficient as it gets:
import org.apache.spark.sql.functions.spark_partition_id
df.groupBy(spark_partition_id).count
Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
The same is possible for a dataframe
, not just for an RDD
.
Just add DF.rdd.glom
... into the code above.
Credits: Mike Dusenberry @ https://issues.apache.org/jira/browse/SPARK-17817
For future PySpark users:
from pyspark.sql.functions import spark_partition_id
rawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()
You can get the number of records per partition like this :
df
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show
But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).
Spark could may also read hive table statistics, but I don't know how to display those metadata..