4 byte unicode character in Java
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char
is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char
is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01"
-- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder
and CharsetEncoder
classes.
See also String.codePointCount()
, and, since Java 8, String.codePoints()
(inherited from CharSequence
).
String s = "𩸽";
Technically this is one character. But be careful s.length()
will returns 2. Also java won't compile String s = '𩸽'
. Java don't promise you that String.length()
shall returns exact number of characters, it returns just number of java-chars required for store this string.
Real number of characters can be obtained from s.codePointCount(0, s.length())
.