Removing non-English text from Corpus in R using tm()
Here's a method to remove words with non-ASCII characters before making a corpus:
# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg.
# dat <- readLines('~/temp/dat.txt')
dat <- "Special, satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Special, Happy, Sad, Potential
You can also use the package "stringi".
Using the above example:
library(stringi)
dat <- "Special, satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")
Output:
[1] "Special, satisfacao, Happy, Sad, Potential, fur"