Unicode characters from charcode in javascript for charcodes > 0xFFFF
The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.
The following function is adapted from Converting punycode with dash character to Unicode:
function utf16Encode(input) {
var output = [], i = 0, len = input.length, value;
while (i < len) {
value = input[i++];
if ( (value & 0xF800) === 0xD800 ) {
throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
}
if (value > 0xFFFF) {
value -= 0x10000;
output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
value = 0xDC00 | (value & 0x3FF);
}
output.push(String.fromCharCode(value));
}
return output.join("");
}
alert( utf16Encode([0x1D400]) );
Section 8.4 of the EcmaScript language spec says
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
So you need to encode supplemental code-points as pairs of UTF-16 code units.
The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
The following table shows the different representations of a few characters in comparison:
code points / UTF-16 code units
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode
:
String.fromCharCode(0xd801, 0xdc00) === 'ð'
String.fromCodePoint()
seems to do the trick as well. See here.
console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));
Output:
ð¢ð£ð¤ð
String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
} else {
return String.fromCharCode(codePt);
}
}