detect non ascii characters in a string
A bit late I guess but it could be useful for the next readers.
You can find these functions:
showNonASCII(<character_vector>)
showNonASCIIfile(<file>)
in the tools
R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.
Why don't you extract the relevant code from showNonASCII
?
x <- c("façile test of showNonASCII(): details{",
"This is a good line", "This has an ümlaut in it.", "OK again. }")
grepNonASCII <- function(x) {
asc <- iconv(x, "latin1", "ASCII")
ind <- is.na(asc) | asc != x
which(ind)
}
grepNonASCII(x)
#[1] 1 3
Came across this later using pure base regex and so simple:
grepl("[^ -~]", x)
## [1] TRUE FALSE TRUE FALSE
More here: http://www.catonmat.net/blog/my-favorite-regex/
another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1] TRUE FALSE TRUE FALSE
Though it seems stringi
has a built in function for this type of things too
stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII" "latin1" "ASCII"