detect non ascii characters in a string

A bit late I guess but it could be useful for the next readers.

You can find these functions:

  • showNonASCII(<character_vector>)
  • showNonASCIIfile(<file>)

in the tools R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.


Why don't you extract the relevant code from showNonASCII?

x <- c("façile test of showNonASCII(): details{", 
       "This is a good line", "This has an ümlaut in it.", "OK again. }")

grepNonASCII <- function(x) {
  asc <- iconv(x, "latin1", "ASCII")
  ind <- is.na(asc) | asc != x
  which(ind)
}

grepNonASCII(x)
#[1] 1 3

Came across this later using pure base regex and so simple:

grepl("[^ -~]", x)
## [1]  TRUE FALSE  TRUE FALSE

More here: http://www.catonmat.net/blog/my-favorite-regex/


another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1]  TRUE FALSE  TRUE FALSE

Though it seems stringi has a built in function for this type of things too

stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII"  "latin1" "ASCII" 

Tags:

R