What is a "surrogate pair" in Java?
Adding some more info to the above answers from this post.
Tested in Java-12, should work in all Java versions above 5.
As mentioned here: https://stackoverflow.com/a/47505451/2987755,
whichever character (whose Unicode is above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values, i.e. the single Unicode character is represented as two adjacent Java characters.
As we can see in the following example.
1. Length:
"ð".length() //2, Expectations was it should return 1
"ð".codePointCount(0,"ð".length()) //1, To get the number of Unicode characters in a Java String
2. Equality:
Represent "ð" to String using Unicode \ud83c\udf09
as below and check equality.
"ð".equals("\ud83c\udf09") // true
Java does not support UTF-32
"ð".equals("\u1F309") // false
3. You can convert Unicode character to Java String
"ð".equals(new String(Character.toChars(0x0001F309))) //true
4. String.substring() does not consider supplementary characters
"ðð".substring(0,1) //"?"
"ðð".substring(0,2) //"ð"
"ðð".substring(0,4) //"ðð"
To solve this we can use String.offsetByCodePoints(int index, int codePointOffset)
"ðð".substring(0,"ðð".offsetByCodePoints(0,1) // "ð"
"ðð".substring(2,"ðð".offsetByCodePoints(1,2)) // "ð"
5. Iterating Unicode string with BreakIterator
6. Sorting Strings with Unicode java.text.Collator
7. Character's toUpperCase()
, toLowerCase()
, methods should not be used, instead, use String uppercase and lowercase of particular locale.
8. Character.isLetter(char ch)
does not support, better used Character.isLetter(int codePoint)
, for each methodName(char ch)
method in the Character class there will be type of methodName(int codePoint)
which can handle supplementary characters.
9. Specify charset in String.getBytes()
, converting from Bytes to String, InputStreamReader
, OutputStreamWriter
Ref:
https://coolsymbol.com/emojis/emoji-for-copy-and-paste.html#objects
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
https://www.ibm.com/developerworks/library/j-unicode/index.html
https://www.oracle.com/technetwork/articles/javaee/supplementary-142654.html
More info on example image1 image2
Other terms worth to explore: Normalization, BiDi
Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme. But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates(in the range U+D800 to U+DBFF) and another 1,024 values to 16-bit low surrogates(in the range U+DC00 to U+DFFF). It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent (the product of 1,024 and 1,024)1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.