Word count in Rails?
The answers here have a couple of issues:
- They don't account for utf and unicode chars (diacritics): áâãêü etc...
- They don't account for apostrophes and hyphens. So
Joe's
will be considered two wordsJoe
and's
which is obviously incorrect. As willtwenty-two
, which is a single compound word.
Something like this works better and account for those issues:
foo.scan(/[\p{Alpha}\-']+/)
You might want to look at my Words Counted gem. It allows to count words, their occurrences, lengths, and a couple of other things. It's also very well documented.
counter = WordsCounted::Counter.new(post.body)
counter.word_count #=> 3
counter.most_occuring_words #=> [["lorem", 3]]
# This also takes into capitalisation into account.
# So `Hello` and `hello` are counted as the same word.
Also:
"Lorem Lorem Lorem".split.size
=> 3
If you're interested in performance, I wrote a quick benchmark:
require 'benchmark'
require 'bigdecimal/math'
require 'active_support/core_ext/string/filters'
# Where "shakespeare" is the full text of The Complete Works of William Shakespeare...
puts 'Benchmarking shakespeare.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.squish.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.split.size } }
puts 'Benchmarking shakespeare.squish.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.split.size } }
The results:
Benchmarking shakespeare.scan(/\w+/).size x50
13.980000 0.240000 14.220000 ( 14.234612)
Benchmarking shakespeare.squish.scan(/\w+/).size x50
40.850000 0.270000 41.120000 ( 41.109643)
Benchmarking shakespeare.split.size x50
5.820000 0.210000 6.030000 ( 6.028998)
Benchmarking shakespeare.squish.split.size x50
31.000000 0.260000 31.260000 ( 31.268706)
In other words, squish
is slow with Very Large Strings™. Other than that, split
is faster (twice as fast if you're not using squish
).
"Lorem Lorem Lorem".scan(/\w+/).size
=> 3
UPDATE: if you need to match rock-and-roll as one word, you could do like
"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4