What runs first: the partitioner or the combiner?

The direct answer to your question is => COMBINER

Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as

The partitioner comes into the picture when we are working one more than on reducer. So, the partitioner decides which reducer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.

For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow
_{(source: flickr.com)}

Here is the tutorial .

combiner runs before partitiooner

combiner runs after map, to reduce the item count of map output. so it decrease the network overload. reduce runs after partitioner

Partition comes first.

According to "Hadoop, the definitive guide", output of Mapper first writen to memory buffer, then spilled to local dir when buffer is about to overflow. The spilling data is parted according to Partitioner, and in each partition the result is sorted and combined if Combiner given.

You can simply modify the wordcount MR program to verify it. My result is: ("the quick brown fox jumped over a lazy dog")

Word, Step, Time

fox, Mapper, **********754

fox, Partitioner, **********754

fox, Combiner, **********850

fox, Reducer, **********904

Obviously, Combiner runs after Partitioner.

Partitioner runs before Combiner: MapReduce Comprehensive Diagram.

You can have custom partition logic, and after mapper results are partitioned, the partitions are sorted and Combiner is applied to the sorted partitions.

See Hadoop MapReduce Comprehensive Description.

I checked it by running a word-count program with custom Combiner and Partitioner with timestamps logging:

Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682580 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682582 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682583 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682584 : world : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : hello : 1
Apr 23, 2018 2:41:22 PM mapreduce.WordCountPartitioner getPartition
INFO: Partitioner: 1524483682585 : world : 1
18/04/23 14:41:22 INFO mapred.LocalJobRunner: 
18/04/23 14:41:22 INFO mapred.MapTask: Starting flush of map output
18/04/23 14:41:22 INFO mapred.MapTask: Spilling map output
18/04/23 14:41:22 INFO mapred.MapTask: bufstart = 0; bufend = 107; bufvoid = 104857600
18/04/23 14:41:22 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214368(104857472); length = 29/6553600
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682614 : hello 
Apr 23, 2018 2:41:22 PM mapreduce.WordCountCombiner reduce
INFO: Combiner: 1524483682615 : world

What runs first: the partitioner or the combiner?

Tags:

Hadoop

Related

Recent Posts