What is a glom?. How it is different from mapPartitions?
Does
glom
shuffle the data across partitions
No, it doesn't
If this is the second case I believe that the same can be achieved using mapPartitions
It can:
rdd.mapPartitions(iter => Iterator(_.toArray))
but the same thing applies to any non shuffling transformation like map
, flatMap
or filter
.
if there are any use cases which benefit from glob.
Any situation where you need to access partition data in a form that is traversable more than once.
glom()
transforms each partition into a tuple (immutabe list) of elements. It creates an RDD
of tuples. One tuple per partition.