4 byte unicode character in Java

A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).

You need to do this:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.

Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01" -- or directly as the symbol if your computing environment has support for it.

See also the CharsetDecoder and CharsetEncoder classes.

See also String.codePointCount(), and, since Java 8, String.codePoints() (inherited from CharSequence).


String s = "𩸽";

Technically this is one character. But be careful s.length() will returns 2. Also java won't compile String s = '𩸽'. Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string.

Real number of characters can be obtained from s.codePointCount(0, s.length()).

Tags:

Java

Unicode