How to write Unicode string to text file in R Windows?
I think setting the Encoding of (a copy of) str
to "unknown"
before using cat()
is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat()
.
Here is an expanded example to demonstrate what I think happens in the original example:
print_info <- function(x) {
print(x)
print(Encoding(x))
str(x)
print(charToRaw(x))
}
cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")
cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")
cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")
cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")
In a "Latin-1"
locale (see ?l10n_info
) as used by R on Windows, output files "yes-iconv"
, "latin"
and "unknown"
should be correct (byte sequence 0xe1
, 0xbb
, 0x8f
which is "ỏ"
).
In a "UTF-8"
locale, files "no-iconv"
and "unknown"
should be correct.
The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:
(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f
(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f
(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
chr "á»"
[1] e1 bb 8f
(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
chr "á»"
[1] e1 bb 8f
In the original example, iconv()
uses the default from = ""
argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str
is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat()
when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).