Does flatmap give better performance than filter+map?
Update: My original answer contained an error: Spark does support Seq
as the result of a flatMap
(and converts the result back into an Dataset
). Apologies for the confusion. I also added more information on improving the performance of your analysis.
Update 2: I missed that you're using a Dataset
rather than an RDD
(doh!). This doesn't affect the answer significantly.
Spark is a distributed system that partitions data across multiple nodes and processes data in parallel. In terms of efficiency, actions that result in re-partitioning (requiring data to be transferred between nodes) is far more expensive in terms of run-time than in-place modifications. Also, you should note that operations that merely transform data, such as filter
, map
, flatMap
, etc. are merely stored and do not execute until an action operation is performed (such as reduce
, fold
, aggregate
, etc.). Consequently, neither alternative actually does anything as things stand.
When an action is performed on the result of these transformations, I would expect the filter
operation to be far more efficient: it only processes data (using the subsequent map
operation) that passes the predicate x=>x.age>25
(more typically written as _.age > 25
). While it may appear that filter
creates an intermediary collection, it executes lazilly. As a result, Spark appears to fuse the filter
and map
operations together.
Your flatMap
operation is, frankly, hideous. It forces processing, sequence creation and subsequent flattening of every data item, which will definitely increase overall processing.
That said, the best way to improve the performance of your analysis is to control the partitioning so that the data is split roughly evenly over as many nodes as possible. Refer to this guide as a good starting point.