Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk?

What follows is my speculation.

Java Strings are internally represented using the UTF-16 encoding. When you getBytes("UTF-8") Java converts between the two encodings, and you probably use an up-to-date Java platform.

When you attempt to store a Java String in the database, Oracle also performs conversion between the Java native UTF-16 and the database character set as determined by NLS_CHARACTERSET.

The chipmunk character was approved as part of the Unicode standard in 2014 (according to the page you linked), while the latest release of Oracle 11g rel.2 was published in 2013.

One might assume that Oracle uses a different or outdated character conversion algorithm so the byte representation of ️) on the server (9 bytes long) is different than what getBytes() returns on the client (7 bytes).

I guess to resolve this issue you could upgrade your Oracle server or use UTF-16 as the database character set.

The issue is with Oracle's handling of Supplementary unicode characters when NLS_LENGTH_SEMANTICS is UTF8.

From the documentation (emphasis added).

The UTF8 character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms.

Supplementary characters inserted into a UTF8 database do not corrupt the data in the database. A supplementary character is treated as two separate, user-defined characters that occupy 6 bytes. Oracle recommends that you switch to AL32UTF8 for full support of supplementary characters in the database character set.

Additionally, the last code point in the squirrel string is a variation selector and optional. I saw this using a unicode character inspector

After changing the database's NLS_CHARACTERSET parameter to AL32UTF8 the test passed.

Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk?

Tags:

Java

Oracle

Unicode

Utf 8

Related

Recent Posts