What characters are grouped with Array.from?
UTF-16 (the encoding used for strings in js) uses 16bit units. So every unicode that can be represented using 15 bit is represented as one code point, everything else as two, known as surrogate pairs. The iterator of strings iterates over code points.
UTF-16 on Wikipedia
It's all about the code behind the characters. Some are coded in two bytes (UTF-16) and are interpreted by Array.from
as two characters. Gotta check the list of the characters :
http://www.fileformat.info/info/charset/UTF-8/list.htm
http://www.fileformat.info/info/charset/UTF-16/list.htm
function displayHexUnicode(s) {
console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}
displayHexUnicode('षि');
console.log(Array.from('षि').forEach(x => displayHexUnicode(x)));
function displayHexUnicode(s) {
console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}
displayHexUnicode('ð');
console.log(Array.from('ð').forEach(x => displayHexUnicode(x)));
For the function that displays the hex code :
Javascript: Unicode string to hex
Array.from
first tries to invoke the iterator of the argument if it has one, and strings do have iterators, so it invokes String.prototype[Symbol.iterator]
, so let's look up how the prototype method works. It's described in the specification here:
- Let O be ? RequireObjectCoercible(this value).
- Let S be ? ToString(O).
- Return CreateStringIterator(S).
Looking up CreateStringIterator
eventually takes you to 21.1.5.2.1 %StringIteratorPrototype%.next ( )
, which does:
- Let cp be ! CodePointAt(s, position).
- Let resultString be the String value containing cp.[[CodeUnitCount]] consecutive code units from s beginning with the code unit at index position.
- Set O.[[StringNextIndex]] to position + cp.[[CodeUnitCount]].
- Return CreateIterResultObject(resultString, false).
The CodeUnitCount
is what you're interested in. This number comes from CodePointAt :
- Let first be the code unit at index position within string.
- Let cp be the code point whose numeric value is that of first.
If first is not a leading surrogate or trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }
.If first is a trailing surrogate or position + 1 = size, then
a.Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Let second be the code unit at index position + 1 within string.
If second is not a trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Set cp to ! UTF16DecodeSurrogatePair(first, second).
Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }
.
So, when iterating over a string with Array.from
, it returns a CodeUnitCount of 2 only when the character in question is the start of a surrogate pair. Characters that are interpreted as surrogate pairs are described here:
Such operations apply special treatment to every code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF (defined by the Unicode Standard as a leading surrogate, or more formally as a high-surrogate code unit) and every code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF (defined as a trailing surrogate, or more formally as a low-surrogate code unit) using the following rules..:
षि
is not a surrogate pair:
console.log('षि'.charCodeAt()); // First character code: 2359, or 0x937
console.log('षि'.charCodeAt(1)); // Second character code: 2367, or 0x93F
But ð
's characters are:
console.log('ð'.charCodeAt()); // 55357, or 0xD83D
console.log('ð'.charCodeAt(1)); // 56397, or 0xDC4D
The first character code of 'ð'
is, in hex, D83D, which is within the range of 0xD800 to 0xDBFF
of leading surrogates. In contrast, the first character code of 'षि'
is much lower, and is not. So the 'षि'
gets split apart, but 'ð'
doesn't.
षि
is composed of two separate characters: ष
, Devanagari Letter Ssa, and ि
, Devanagari Vowel Sign I. When next to each other in this order, they get graphically combined into a single character visually, despite being composed of two separate characters.
In contrast, the character codes of ð
only make sense when together as a single glyph. If you try to use a string with either code point without the other, you'll get a nonsense symbol:
console.log('ð'[0]);
console.log('ð'[1]);