Apache Beam : FlatMap vs Map?
These transforms in Beam are exactly same as Spark (Scala too).
A Map
transform, maps from a PCollection
of N elements into another PCollection
of N elements.
A FlatMap
transform maps a PCollections
of N elements into N collections of zero or more elements, which are then flattened into a single PCollection
.
As a simple example, the following happens:
beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any'])
# The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']]
Whereas:
beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any'])
# The lists that are output by the lambda, are then flattened into a
# collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any']
Let me show you one example
import apache_beam as beam
def categorize_explode(text):
result = text.split(':')
category = result[0]
elements = result[1].split(',')
return list(map(lambda x: (category, x), elements))
with beam.Pipeline() as pipeline:
things = (
pipeline
| 'Categories and Elements' >> beam.Create(["Vehicles:Car,Jeep,Truck,BUS,AIRPLANE","FOOD:Burger,SANDWICH,ICECREAM,APPLE"])
| 'Explode' >> beam.FlatMap(categorize_explode)
| beam.Map(print)
)
As you can see categorize_explode
function splits the strings into categories and corresponding elements and returns iterator like [('Vehicles','Car'),('Vehicles','Jeep'),...]
FlatMap takes each element in this iterator and treats each element as a separate element in PCollection.
So the result would be:
('Vehicles', 'Car')
('Vehicles', 'Jeep')
('Vehicles', 'Truck')
('Vehicles', 'BUS')
('Vehicles', 'AIRPLANE')
('FOOD', 'Burger')
('FOOD', 'SANDWICH')
('FOOD', 'ICECREAM')
('FOOD', 'APPLE')
While Map performs one to one mapping. i.e. this iterator [('Vehicles','Car'),('Vehicles','Jeep'),...]
would be returned as it is.
So the result would be for Map:
[('Vehicles', 'Car'), ('Vehicles', 'Jeep'), ('Vehicles', 'Truck'), ('Vehicles', 'BUS'), ('Vehicles', 'AIRPLANE')]
[('FOOD', 'Burger'), ('FOOD', 'SANDWICH'), ('FOOD', 'ICECREAM'), ('FOOD', 'APPLE')]
The approach I have used is somewhat similar to spark explode transform.
Hope this helps!!!