How to create a histogram from a flat Array in Ruby
Use "histogram".
data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]
(bins, freqs) = data.histogram
This will create an array bins
containing the bins of histogram and the array freqs
containing the frequencies.
The gem also supports different binning behaviors and weights/fractions.
Hope this helps.
Ruby's Array inherits group_by
from Enumerable, which does this nicely:
Hash[*data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }]
Which returns:
{
0 => 1,
1 => 1,
2 => 5,
3 => 6,
4 => 4,
5 => 2,
6 => 3,
7 => 5,
8 => 1,
9 => 2,
10 => 1
}
That's just a nice 'n clean hash. If you want an array of each bin and frequency pair you can shorten it and use:
data = [0,1,2,2,3,3,3,4]
data.group_by{ |v| v }.map{ |k, v| [k, v.size] }
# => [[0, 1], [1, 1], [2, 2], [3, 3], [4, 1]]
Here's what the code and group_by
is doing with the smaller dataset:
data.group_by{ |v| v }
# => {0=>[0], 1=>[1], 2=>[2, 2], 3=>[3, 3, 3], 4=>[4]}
data.group_by{ |v| v }.flat_map{ |k, v| [k, v.size] }
# => [0, 1, 1, 1, 2, 2, 3, 3, 4, 1]
As mentioned by Telmo Costa in the comments, Ruby introduced tally
in v2.7.0. Running a quick benchmark shows that tally
is about 3x faster:
require 'fruity'
puts "Ruby v#{RUBY_VERSION}"
data = [0,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,6,6,6,7,7,7,7,7,8,9,9,10]
data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by { |v| v }.transform_values(&:size)
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.tally
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h
# => {0=>1, 1=>1, 2=>5, 3=>6, 4=>4, 5=>2, 6=>3, 7=>5, 8=>1, 9=>2, 10=>1}
compare do
gb { data.group_by{ |v| v }.map{ |k, v| [k, v.size] }.to_h }
rriemann { data.group_by { |v| v }.transform_values(&:size) }
telmo_costa { data.tally }
CBK {data.group_by{ |v| v }.keys.sort.map { |key| [key, data.group_by{ |v| v }[key].size] }.to_h }
end
Resulting in:
# >> Ruby v2.7.0
# >> Running each test 1024 times. Test will take about 2 seconds.
# >> telmo_costa is faster than rriemann by 2x ± 0.1
# >> rriemann is similar to gb
# >> gb is faster than CBK by 8x ± 1.0
So use tally
.