How do I convert unicode codepoints to their character representation?

The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn" rather than the Java formats of "\unnnn" or "0xnnnn). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

The introduction of Streams in Java 8.
Method public static String toString(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[], so Character.toString(0x00E4) returns "ä".

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() {

    // Create a test string containing "Hello World ð" with code points in Unicode format.
    // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
    String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";

    String text = Arrays.stream(data.split("\\+U"))
            .filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
            .map(s -> {
                try {
                    return Integer.parseInt(s, 16);
                } catch (NumberFormatException e) { 
                    System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
                }
                return null; // If the code point is not represented as a valid hex String.
            })
            .filter(v -> v != null) // Ignore syntactically invalid code points.
            .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
            .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
            .collect(Collectors.joining());

    System.out.println(text); // Prints "Hello World ð"
}

And this is the output:

run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World ð
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the Stream processing. Of course the same code could still be used to process just a single code point in Unicode format.
It's easy to add intermediate operations to perform further validation and processing on the Stream, such as case conversion, removal of emoticons, etc.

Code points are written as hexadecimal numbers prefixed by U+

So,you can do this

int codepoint=Integer.parseInt(yourString.substring(2),16);
char[] ch=Character.toChars(codepoint);

How do I convert unicode codepoints to their character representation?

Tags:

Java

Unicode

Related

Recent Posts