Implementing Bag-of-Words Naive-Bayes classifier in NLTK
- put the string you are looking at into a list, broken into words
- for each item in the list, ask: is this item a feature I have in my feature list.
- If it is, add the log prob as normal, if not, ignore it.
If your sentence has the same word multiple times, it will just add the probs multiple times. If the word appears multiple times in the same class, your training data should reflect that in the word count.
For added accuracy, count all bi-grams, tri-grams, etc as separate features.
It helps to manually write your own classifiers so that you understand exactly what is happening and what you need to do to imporve accuracy. If you use a pre-packaged solution and it doesn't work well enough, there is not much you can do about it.
scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.
As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)
import numpy as np
from nltk.probability import FreqDist
from nltk.classify import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=1000)),
('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
from nltk.corpus import movie_reviews
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
add_label = lambda lst, lab: [(x, lab) for x in lst]
classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))
l_pos = np.array(classif.classify_many(pos[100:]))
l_neg = np.array(classif.classify_many(neg[100:]))
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
(l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
(l_neg == 'pos').sum(), (l_neg == 'neg').sum())
This printed for me:
Confusion matrix:
524 376
202 698
Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.
The features in the NLTK bayes classifier are "nominal", not numeric. This means they can take a finite number of discrete values (labels), but they can't be treated as frequencies.
So with the Bayes classifier, you cannot directly use word frequency as a feature-- you could do something like use the 50 more frequent words from each text as your feature set, but that's quite a different thing
But maybe there are other classifiers in the NLTK that depend on frequency. I wouldn't know, but have you looked? I'd say it's worth checking out.