Detecting whether or not text is English (in bulk)

I read a method to detect English language by using Trigrams

You can go over the text, and try to detect the most used trigrams in the words. If the most used ones match with the most used among english words, the text may be written in English

Try to look in this ruby project:

https://github.com/feedbackmine/language_detector

EDIT: This won't work in this case, since OP is processing text in bulk which is against Google's TOS.

Use the Google Translate language detect API. Python example from the docs:

url = ('https://ajax.googleapis.com/ajax/services/language/detect?' +
       'v=1.0&q=Hola,%20mi%20amigo!&key=INSERT-YOUR-KEY&userip=INSERT-USER-IP')
request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)
results = simplejson.load(response)
if results['responseData']['language'] == 'en':
    print 'English detected'

Detecting whether or not text is English (in bulk)

Tags:

Python

Nlp

Language Detection

Related

Recent Posts