How do I convert unicode codepoints to their character representation?
The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn"
rather than the Java formats of "\unnnn"
or "0xnnnn
). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:
- The introduction of Streams in Java 8.
- Method
public static String toString(int codePoint)
which was added to theCharacter
class in Java 11. It returns aString
rather than achar[]
, soCharacter.toString(0x00E4)
returns"ä"
.
Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String
in a single statement:
void processUnicode() {
// Create a test string containing "Hello World ð" with code points in Unicode format.
// Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";
String text = Arrays.stream(data.split("\\+U"))
.filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
.map(s -> {
try {
return Integer.parseInt(s, 16);
} catch (NumberFormatException e) {
System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
}
return null; // If the code point is not represented as a valid hex String.
})
.filter(v -> v != null) // Ignore syntactically invalid code points.
.filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
.map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
.collect(Collectors.joining());
System.out.println(text); // Prints "Hello World ð"
}
And this is the output:
run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World ð
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
- With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the
Stream
processing. Of course the same code could still be used to process just a single code point in Unicode format. - It's easy to add intermediate operations to perform further validation and processing on the
Stream
, such as case conversion, removal of emoticons, etc.
Code points are written as hexadecimal numbers prefixed by U+
So,you can do this
int codepoint=Integer.parseInt(yourString.substring(2),16);
char[] ch=Character.toChars(codepoint);