Orange vs NLTK for Content Classification in Python

NLTK is a toolkit that supports a four state model of natural language processing:

  1. Tokenizing: grouping characters as words. This ranges from trivial regex stuff to dealing with contractions like "can't"
  2. Tagging. This is applying part-of-speech tags to the tokens (eg "NN" for noun, "VBG" for verb gerund). This is typically done by training a model (eg Hidden Markov) on a training corpus (i.e. large list of by by hand tagged sentences).
  3. Chunking/Parsing. This is taking each tagged sentence and extracting features into a tree (eg noun phrases). This can be according to a hand-written grammar or a one trained on a corpus.
  4. Information extraction. This is traversing the tree and extracting the data. This is where your specific orange=fruit would be done.

NLTK supports WordNet, a huge semantic dictionary that classifies words. So there are 5 noun definitions for orange (fruit, tree, pigment, color, river in South Africa). Each of these has one or more 'hypernym paths' that are hierarchies of classifications. E.g. the first sense of 'orange' has a two paths:

  • orange/citrus/edible_fruit/fruit/reproductive_structure/plant_organ/plant_part/natural_object/whole/object/physical_entity/entity

and

  • orange/citrus/edible_fruit/produce/food/solid/matter/physical_entity/entity

Depending on your application domain you can identify orange as a fruit, or a food, or a plant thing. Then you can use the chunked tree structure to determine more (who did what to the fruit, etc.)


Well as evidenced by the documentation, the Naive Bayes implementation in each Library is easy to use, so why not run your data with both and compare the results?

Both Orange and NLTK are both mature, stable libraries (10+ years in development for each library) that originated in large universities; they share some common features primarily Machine Learning algorithms. Beyond that, they are quite different in scope, purpose, and implementation.

Orange is domain agnostic--not directed towards a particular academic discipline or commercial domain, instead it advertises itself as full-stack data mining and ML platform. It's focus is on the tools themselves and not the application of those tools in a particular discipline.

Its features include IO, the data analysis algorithm, and a data visualization canvas.

NLTK, on the other hand, began as and remains an academic project in a computational linguistics department of a large university. The task you mentioned (document content classification) and your algorithm of choice (Naive Bayesian) are pretty much right at the core of NLTK's functionality. NLTK does indeed have ML/Data Mining algorithms but its only because they have a particular utility in computational linguistics.

NLTK of course includes some ML algorithms but only because they have utility in computational linguistics, along with document parsers, tokenizers, part-of-speech analyzers, etc.--all of which comprise NLTK.

Perhaps the Naive Bayes implementation in Orange is just as good, i would still choose NLTK's implementation because it is clearly optimized for the particular task you mentioned.

There are numerous tutorials on NLTK and in particular for its Naive Bayes for use content classification. A blog post by Jim Plus and another in streamhacker.com, for instance present excellent tutorials for the use of NLTK's Naive Bayes; the second includes a line-by-line discussion of the code required to access this module. The authors of both of these posts report good results using NLTK (92% in the former, 73% in the latter).


I don't know Orange, but +1 for NLTK:

I've successively used the classification tools in NLTK to classify text and related meta data. Bayesian is the default but there are other alternatives such as Maximum Entropy. Also being a toolkit, you can customize as you see fit - eg. creating your own features (which is what I did for the meta data).

NLTK also has a couple of good books - one of which is available under Creative Commons (as well as O'Reilly).