NLTK and language detection
Super late but, you could use textcat
classifier in nltk
, here. This paper discusses the algorithm.
It returns a country code in ISO 639-3, so I would use pycountry
to get the full name.
For example, load the libraries
import nltk
import pycountry
from nltk.stem import SnowballStemmer
Now let's look at two phrases, and guess
their language:
phrase_one = "good morning"
phrase_two = "goeie more"
tc = nltk.classify.textcat.TextCat()
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)
guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
You could then pass them into other nltk
functions, for example:
stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
Disclaimer obviously this will not always work, especially for sparse data
Extreme example
guess_example = tc.guess_language("hello")
Konkani (individual language)
This library is not from NLTK either but certainly helps.
$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
>>> detect("Ein, zwei, drei, vier")
P.S.: Don't expect this to work correctly always:
>>> detect("today is a good day")
>>> detect("today is a good day.")
>>> detect("la vita e bella!")
>>> detect("khoobi? khoshi?")
>>> detect("wow")
>>> detect("what a day")
>>> detect("yay!")
Have you come across the following code snippet?
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)
Or the following demo file?
Although this is not in the NLTK, I have had great results with another Python-based library :
This is very simple to import and includes a large number of languages in its model.