Detect text language in R
Try http://cran.r-project.org/web/packages/cldr/ which brings Google Chrome's language detection to R.
#install from archive
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)
# or devtools::install_version("cldr",version="1.1.0")
#usage
library(cldr)
demo(cldr)
The cldr
package in a previous answer is not any more available on CRAN and may be difficult to install. However, Google's (Chromium's) cld
libraries are now available in R through other dedicated packages, cld2
and cld3
.
After testing with some thousands of tweets in multiple European languages, I can say that among available options, textcat
is by far the least reliable. With textcat
I also get quite frequently tweets wrongly detected as "middle_frisian", "rumantsch", "sanskrit", or other unusual languages. It may be relatively good with other types of texts, but I think textcat
is pretty bad for tweets.
cld2
seems to be in general still better than cld3
. If you want a safe way to include only tweets in English, you can still run both cld2
and cld3
and keep only tweets that are recognised as English by both.
Here's an example based on a Twitter search which usually offers result in many different languages, but always including some tweets in English.
if (!require("pacman")) install.packages("pacman") # for package manangement
pacman::p_load("tidyverse")
pacman::p_load("textcat")
pacman::p_load("cld2")
pacman::p_load("cld3")
pacman::p_load("rtweet")
punk <- rtweet::search_tweets(q = "punk") %>% mutate(textcat = textcat(x = text), cld2 = cld2::detect_language(text = text, plain_text = FALSE), cld3 = cld3::detect_language(text = text)) %>% select(text, textcat, cld2, cld3)
View(punk)
# Only English tweets
punk %>% filter(cld2 == "en" & cld3 == "en")
Finally, I should perhaps add the obvious if this question is specifically related to tweets: Twitter provides via API its own language detection for tweets, and its seems to be pretty accurate (understandably less so with very short tweets). So if you run rtweet::search_tweets(q = "punk")
, you will see that the resulting data.frame includes a "lang" column. If you get your tweets via API, then you can probably trust Twitter's own detection system more than the alternative solutions suggested above (which remain valid for other texts).
The textcat
package does this. It can detect 74 'languages' (more properly, language/encoding combinations), more with other extensions. Details and examples are in this freely available article:
Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., & Feinerer, I. The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52, 1-17.
Here's the abstract:
Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.
And here's one of their examples:
library("textcat")
textcat(c(
"This is an English sentence.",
"Das ist ein deutscher Satz.",
"Esta es una frase en espa~nol."))
[1] "english" "german" "spanish"