Detecting language using Stanford NLP
Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.
EDIT: Nevertheless, below are circumstantial evidences:
- there is no mention of language identification neither on main page, nor CoreNLP page, nor in FAQ (although there is a question 'How do I run CoreNLP on other languages?'), nor in 2014 paper of CoreNLP's authors;
- tools that combine several NLP libs including Stanford CoreNLP use another lib for language identification, for example DKPro Core ASL; also other users talking about language identification and CoreNLP don't mention this capability
- source file of CoreNLP contains
Language
classes, but nothing related to language identification - you can check manually for all 84 occurrence of 'language' word here
Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").
In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.
Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml
There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:
- a fix list of languages,
- a substantial length of the test sentences and
- of the same language and
- a skewed proportion of training to testing instances.
Notable language ID tools includes:
- TextCat (http://cran.r-project.org/web/packages/textcat/index.html)
- CLD2 (https://code.google.com/p/cld2/)
- LingPipe (http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html)
- LangID (https://github.com/saffsd/langid.py)
- CLD3 (https://github.com/google/cld3)
An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/
Noteworthy Language Identification related shared task (with training/testing data) includes:
- Native Language ID (NLI 2013)
- Discriminating Similar Languages (DSL 2014)
- TweetID (2015)
Also take a look at:
- Language Identification: The Long and the Short of the Matter
- The Problems of Language Identification within Hugely Multilingual Data Sets
- Selecting and Weighting N-Grams to Identify 1100 Languages
- Indigenous Tweets
- Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text