Ruby Text Analysis

The Mendicant Bug: NLP Resources for Ruby contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.

If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.

Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.


the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

Check the following thread, which contains more details and links:

Building openears compatible language model

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92's post lists some more Ruby NLP resources.

You can also Google for "ARPA Language Model" for more info

Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!