dplyr filter condition to distinguish between unicode symbol and its unicode representation
Edit:
The function glyphs_match()
from the gdtools
package is designed for this, however, using it didn't quite return the expected result. I'm using Lucida Console
as my font and obtain the following output when using glyphs_match()
. There seems to be one glyph that isn't rendered but for which the function returns TRUE
. Perhaps other users can explain why that is the case.
df$glyph_match <- gdtools::glyphs_match(df$Symbol, fontfile = "C:\\WINDOWS\\Fonts\\lucon.TTF")
df
Character Symbol glyph_match
1 \\u0024 $ TRUE
2 \\u00A2 ¢ TRUE
3 \\u00A3 £ TRUE
4 \\u00A4 ¤ TRUE
5 \\u00A5 ¥ TRUE
6 \\u058F <U+058F> FALSE
7 \\u060B <U+060B> FALSE
8 \\u07FE <U+07FE> FALSE
9 \\u07FF <U+07FF> FALSE
10 \\u09F2 <U+09F2> FALSE
11 \\u09F3 <U+09F3> FALSE
12 \\u09FB <U+09FB> FALSE
13 \\u0AF1 <U+0AF1> FALSE
14 \\u0BF9 <U+0BF9> FALSE
15 \\u0E3F <U+0E3F> FALSE
16 \\u17DB <U+17DB> FALSE
17 \\u20A0 <U+20A0> FALSE
18 \\u20A1 ¢ TRUE
19 \\u20A2 <U+20A2> FALSE
20 \\u20A3 <U+20A3> TRUE
Earlier answer - may only work on Windows:
There will be variation depending on your font/system, for example, when running your code my output doesn't match what you've provided:
df <- structure(list(Character = c("\\u0024", "\\u00A2", "\\u00A3",
"\\u00A4", "\\u00A5", "\\u058F", "\\u060B", "\\u07FE", "\\u07FF",
"\\u09F2", "\\u09F3", "\\u09FB", "\\u0AF1", "\\u0BF9", "\\u0E3F",
"\\u17DB", "\\u20A0", "\\u20A1", "\\u20A2", "\\u20A3"),
Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff",
"৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠",
"₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")
df
Character Symbol
1 \\u0024 $
2 \\u00A2 ¢
3 \\u00A3 £
4 \\u00A4 ¤
5 \\u00A5 ¥
6 \\u058F <U+058F>
7 \\u060B <U+060B>
8 \\u07FE <U+07FE>
9 \\u07FF <U+07FF>
10 \\u09F2 <U+09F2>
11 \\u09F3 <U+09F3>
12 \\u09FB <U+09FB>
13 \\u0AF1 <U+0AF1>
14 \\u0BF9 <U+0BF9>
15 \\u0E3F <U+0E3F>
16 \\u17DB <U+17DB>
17 \\u20A0 <U+20A0>
18 \\u20A1 ¢
19 \\u20A2 <U+20A2>
20 \\u20A3 <U+20A3>
But one crude way of capturing if the glyph exists is:
nchar(capture.output(cat(df$Symbol, sep = "\n"))) == 1
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] TRUE FALSE FALSE
So the glyphs can be filtered by:
library(dplyr)
df %>%
filter(nchar(capture.output(cat(Symbol, sep = "\n"))) == 1)
Character Symbol
1 \\u0024 $
2 \\u00A2 ¢
3 \\u00A3 £
4 \\u00A4 ¤
5 \\u00A5 ¥
6 \\u20A1 ¢
Use as.character.POSIXt
to 'render' symbols and pad with spaces. Unicode characters in the form "\uxxxx" will be printed as a single character and all others will be larger; then you can filter according to length:
# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 2)
# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) == 1)
If you have a long string as a 'symbol' (e.g. "aaaaaaaaaa₣") the padding will be increased and need to be accounted for e.g.
# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 11)
# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) <= 10)