How to re-partition pyspark dataframe?
print df.rdd.getNumPartitions()
# 1
df.repartition(5)
print df.rdd.getNumPartitions()
# 1
df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5
see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219
You can check the number of partitions:
data.rdd.partitions.size
To change the number of partitions:
newDF = data.repartition(3000)
You can check the number of partitions:
newDF.rdd.partitions.size
Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce
if needed.