Does a flatMap in spark cause a shuffle?
There is no shuffling with either map or flatMap. The operations that cause shuffle are:
- Repartition operations:
- Repartition:
- Coalesce:
- ByKey operations (except for counting):
- GroupByKey:
- ReduceByKey:
- Join operations:
- Cogroup:
- Join:
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
- mapPartitions to sort each partition using, for example, .sorted
- repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
- sortBy to make a globally ordered RDD
More info here: http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations
No shuffling. Here are the sources for both functions:
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
As you can see, RDD.flatMap
just calls flatMap
on Scala's iterator that represents partition.