How to trim white spaces when trimws is not working?
The character with ASCII code 160 is called a "non-breaking space." One can read about it in Wikipedia:
https://en.wikipedia.org/wiki/Non-breaking_space
The trimws()
function does not include it in the list of characters that are removed by the function:
x <- intToUtf8(c(160,49,49,57,57,46,48,48))
x
#[1] " 1199.00"
trimws(x)
#[1] " 1199.00"
One way to get rid of it is by using str_trim()
function from the stringr library:
library(stringr)
y <- str_trim(x)
trimws(y)
[1] "1199.00"
Another way is by applying iconv()
function first:
y <- iconv(x, from = 'UTF-8', to = 'ASCII//TRANSLIT')
trimws(y)
#[1] "1199.00"
UPDATE To explain why trimws() does not remove the "invisible" character described above and stringr::str_trim() does.
Here is what we read from trimws()
help:
For portability, ‘whitespace’ is taken as the character class [ \t\r\n] (space, horizontal tab, line feed, carriage return)
For stringr::str_trim()
help topic itself does not specify what is considered a "white space" but if you look at the help for stri_trim_both
which is called by str_trim()
you will see: stri_trim_both(str, pattern = "\\P{Wspace}")
Basically in this case it is using a wider range of characters that are considered as a white space.
UPDATE 2
As @H1 noted, version 3.6.0 provides an option to specify what to consider a whitespace character:
Internally, 'sub(re, "", *, perl = TRUE)', i.e., PCRE library regular expressions are used. For portability, the default 'whitespace' is the character class '[ \t\r\n]' (space, horizontal tab, carriage return, newline). Alternatively, '[\h\v]' is a good (PCRE) generalization to match all Unicode horizontal and vertical white space characters, see also <URL: https://www.pcre.org>.
So if you are using version 3.6.0 or later you can simply do:
> trimws(x,whitespace = "[\\h\\v]")
#[1] "1199.00"
From R version 3.6.0 trimws()
has an argument allowing you to define what is considered whitespace which in this case is a no break space.
trimws(x, whitespace = "\u00A0|\\s")
[1] "1199.00"