Is a Spark RDD deterministic for the set of elements in each partition?
Lets look at the source, and specifically its shuffle part:
...
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
...
As you can see the distribution of elements from a given source partition N
into X
target partitions is a simple increment (later modulo'ed by X
) starting from some number which depends only on that N
, and hence pre-determined. So if your source RDD is unchanged, the result of repartition(X)
should be the same every time as well.