In Apache Spark, why does RDD.union not preserve the partitioner?
union
is a very efficient operation, because it doesn't move any data around. If rdd1
has 10 partitions and rdd2
has 20 partitions then rdd1.union(rdd2)
will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1
and rdd2
.
After taking the union you can run repartition
to shuffle the data and organize it by key.
There is one exception to the above. If rdd1
and rdd2
have the same partitioner (with the same number of partitions), union
behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the union
ed RDD will also have those same partitions. This was introduced in https://github.com/apache/spark/pull/4629 and incorporated into Spark 1.3.