How to create a histogram from a flat Array in Ruby

Use "histogram".

data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]
(bins, freqs) = data.histogram 

This will create an array bins containing the bins of histogram and the array freqs containing the frequencies. The gem also supports different binning behaviors and weights/fractions.

Hope this helps.


Ruby's Array inherits group_by from Enumerable, which does this nicely:

Hash[*data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }]

Which returns:

{
     0 => 1,
     1 => 1,
     2 => 5,
     3 => 6,
     4 => 4,
     5 => 2,
     6 => 3,
     7 => 5,
     8 => 1,
     9 => 2,
    10 => 1
}

That's just a nice 'n clean hash. If you want an array of each bin and frequency pair you can shorten it and use:

data = [0,1,2,2,3,3,3,4]
data.group_by{ |v| v }.map{ |k, v| [k, v.size] }
# => [[0, 1], [1, 1], [2, 2], [3, 3], [4, 1]]

Here's what the code and group_by is doing with the smaller dataset:

data.group_by{ |v| v }    
# => {0=>[0], 1=>[1], 2=>[2, 2], 3=>[3, 3, 3], 4=>[4]}

data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }  
# => [0, 1, 1, 1, 2, 2, 3, 3, 4, 1]

As mentioned by Telmo Costa in the comments, Ruby introduced tally in v2.7.0. Running a quick benchmark shows that tally is about 3x faster:

require 'fruity'

puts "Ruby v#{RUBY_VERSION}"

data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]

data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by { |v| v }.transform_values(&:size)
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.tally 
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}

compare do
  gb { data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h }
  rriemann { data.group_by { |v| v }.transform_values(&:size) }
  telmo_costa { data.tally }
  CBK {data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h }
end

Resulting in:

# >> Ruby v2.7.0
# >> Running each test 1024 times. Test will take about 2 seconds.
# >> telmo_costa is faster than rriemann by 2x ± 0.1
# >> rriemann is similar to gb
# >> gb is faster than CBK by 8x ± 1.0

So use tally.

Tags:

Ruby

Histogram