Python: How to determine the language?

TextBlob. Requires NLTK package, uses Google.

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

Polyglot. Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run: sudo apt-get install python-numpy libicu-dev

chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

langdetect Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:
```
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
```
pip install langdetect
guess_language Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

langid provides both module

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

and a command-line tool:

$ langid < README.md

pip install langid

FastText is a text classifier, can be used to recognize 176 languages with a proper models for language classification. Download this model, then:

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('الشمس تشرق', k=2))  # top 2 matching languages

(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

pyCLD3 is a neural network model for language identification. This package contains the inference code and a trained model.

import cld3
cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")

LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

There is an issue with langdetect when it is being used for parallelization and it fails. But spacy_langdetect is a wrapper for that and you can use it for that purpose. You can use the following snippet as well:

import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("en")
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
text = "This is English text Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque. Je m'appelle Angélica Summer, j'ai 12 ans et je suis canadienne."
doc = nlp(text)
# document level language detection. Think of it like average language of document!
print(doc._.language['language'])
# sentence level language detection
for i, sent in enumerate(doc.sents):
    print(sent, sent._.language)

Python: How to determine the language?

Tags:

Python

String

Parsing

Related

Recent Posts