solving a problem with map reduce
Using irb (ruby-1.9.2-p180):
list = [ {a:2, b:1, d:3}, {a:3, b:2, c:3}, {a:4, b:1, c:3} ]
=> [{:a=>2, :b=>1, :d=>3}, {:a=>3, :b=>2, :c=>3}, {:a=>4, :b=>1, :c=>3}]
Hash[list.map(&:keys).inject(&:&).map{|key| [key,list.map{|arr| arr[key]}.inject(&:+)]}]
=> {:a=>9, :b=>4}
this solution works with multiple arrays (2+) it finds common keys and sums them returning a hash of results
to find common keys (gather keys and find common part):
list.map(&:keys).inject(&:&)
to find sum for key (select values by keys and sum them):
list.map{|arr| arr[key]}.inject(&:+)
to build Hash from array of pairs [[:a,9], [:b,4]]
:
results = [[:a,9], [:b,4]]
Hash[ results ]
I love ruby for this one liners!
You could try by considering the elements given in MapReduce wikipedia article:
- an input reader - in your case this would probably be a method call on
[key, value]
pair from your input hashes. - a Map function - you already have keys you should be processing your data by, so your
map
worker would just return the[key, value]
pair it got as an input - a partition function - a method which would assign a reduce worker based on the key. In your case it could be simply
key.hash % REDUCER_COUNT
. - a compare function - I don't think this is applicable in your case as you don't need values to be processed in any particular order.
- a Reduce function - would be given
[key, list]
pair, list being list of values associated with the key. It would return the sum oflist
if list is more than one element long (as you want only elements appearing in both input hashes processed). - an output writer - could be plain Hash in your example.
And here's my (over)simplified implementation of the above.
Assuming that we have all the other map-reduce-related functions implemented (input reader, output writer, global sort, ...), these would be the map
and reduce
ones:
def map(input)
input.each do |count, letter|
yield [letter, count]
end
end
def reduce(letter, partial_counts)
result = if partial_counts.size == 2
partial_counts[0] + partial_counts[1]
end
yield result
end
The map
function will yield
a pair (letter, count)
, which will be grouped later. Then for each letter
received from map
s reduce
will receive an array containing every count yielded by a map
for that letter
. As you wish to only yield if the letter occurs on both hashes, we need count
s to appear on partial_counts
twice to use it to compute the sum at the end. The reduce
function could be implemented in several ways. I've tried to make it as simples as possible to understand, although it's implementation is very adjusted to this problem.
Using these map
and reduce
implementation will return the last hash with keys and value inverted, which makes more sense, as there may be several letter with the same count. The input would be better if it inverted keys and values too. This way, map
would be as simple as yielding each pair of (letter, count)
:
def map(input)
input.each do |letter, count|
yield [letter, count]
end
end
or
def map(input)
input.each do |i|
yield i
end
end