Unicode characters from charcode in javascript for charcodes > 0xFFFF

The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.

The following function is adapted from Converting punycode with dash character to Unicode:

function utf16Encode(input) {
    var output = [], i = 0, len = input.length, value;
    while (i < len) {
        value = input[i++];
        if ( (value & 0xF800) === 0xD800 ) {
            throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
        }
        if (value > 0xFFFF) {
            value -= 0x10000;
            output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
            value = 0xDC00 | (value & 0x3FF);
        }
        output.push(String.fromCharCode(value));
    }
    return output.join("");
}

alert( utf16Encode([0x1D400]) );

Section 8.4 of the EcmaScript language spec says

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

So you need to encode supplemental code-points as pairs of UTF-16 code units.

The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

The following table shows the different representations of a few characters in comparison:

code points / UTF-16 code units

U+0041 / 0041

U+00DF / 00DF

U+6771 / 6771

U+10400 / D801 DC00

Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:

String.fromCharCode(0xd801, 0xdc00) === 'ð'

String.fromCodePoint() seems to do the trick as well. See here.

console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));

Output:

ð¢ð£ð¤ð

String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:

function fixedFromCharCode (codePt) {
    if (codePt > 0xFFFF) {
        codePt -= 0x10000;
        return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
    } else {
        return String.fromCharCode(codePt);
    }
}

Unicode characters from charcode in javascript for charcodes > 0xFFFF

Tags:

Javascript

Unicode

Astral Plane

Related

Recent Posts