Word count in Rails?

The answers here have a couple of issues:

  1. They don't account for utf and unicode chars (diacritics): áâãêü etc...
  2. They don't account for apostrophes and hyphens. So Joe's will be considered two words Joe and 's which is obviously incorrect. As will twenty-two, which is a single compound word.

Something like this works better and account for those issues:

foo.scan(/[\p{Alpha}\-']+/)

You might want to look at my Words Counted gem. It allows to count words, their occurrences, lengths, and a couple of other things. It's also very well documented.

counter = WordsCounted::Counter.new(post.body)
counter.word_count #=> 3
counter.most_occuring_words #=> [["lorem", 3]]
# This also takes into capitalisation into account.
# So `Hello` and `hello` are counted as the same word.

Also:

"Lorem Lorem Lorem".split.size
=> 3

If you're interested in performance, I wrote a quick benchmark:

require 'benchmark'
require 'bigdecimal/math'
require 'active_support/core_ext/string/filters'

# Where "shakespeare" is the full text of The Complete Works of William Shakespeare...

puts 'Benchmarking shakespeare.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.squish.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.split.size } }
puts 'Benchmarking shakespeare.squish.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.split.size } }

The results:

Benchmarking shakespeare.scan(/\w+/).size x50
 13.980000   0.240000  14.220000 ( 14.234612)
Benchmarking shakespeare.squish.scan(/\w+/).size x50
 40.850000   0.270000  41.120000 ( 41.109643)
Benchmarking shakespeare.split.size x50
  5.820000   0.210000   6.030000 (  6.028998)
Benchmarking shakespeare.squish.split.size x50
 31.000000   0.260000  31.260000 ( 31.268706)

In other words, squish is slow with Very Large Strings™. Other than that, split is faster (twice as fast if you're not using squish).


"Lorem Lorem Lorem".scan(/\w+/).size
=> 3

UPDATE: if you need to match rock-and-roll as one word, you could do like

"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4