Unicode string with diacritics split by chars

A little update on this.

As ES6 came by, there are new string methods and ways of dealing with strings. There are solutions for two problems present in this.

1) Emoji and surrogate pairs

Emoji and other Unicode characters that fall above the Basic Multilingual Plane (BMP) (Unicode "code points" in the range 0x0000 - 0xFFFF) can be worked out as the strings in ES6 adhere to the iterator protocol, so you can do like this:

let textWithEmoji = '\ud83d\udc0e\ud83d\udc71\u2764'; //horse, happy face and heart
[...textWithEmoji].length //3
for (char of textWithEmoji) { console.log(char) } //will log 3 chars

2) Diacritics

A harder problem to solve, as you start to work with "grapheme clusters" (a character and it's diacritics). In ES6 there is a method that simplify working with this, but it's still hard to work. The String.prototype.normalize method eases the work, but as Mathias Bynens puts:

(A) code points with multiple combining marks applied to them always result in a single visual glyph, but may not have a normalized form, in which case normalization doesn’t help.

More insight can be found here:

https://ponyfoo.com/articles/es6-strings-and-unicode-in-depth https://mathiasbynens.be/notes/javascript-unicode

To do this properly, what you want is the algorithm for working out the grapheme cluster boundaries, as defined in UAX 29. Unfortunately this requires knowledge of which characters are members of which classes, from the Unicode Character Database, and JavaScript doesn't make that information available(*). So you'd have to include a copy of the UCD with your script, which would make it pretty bulky.

An alternative if you only need to worry about the basic accents used by Latin or Cyrillic would be to take only the Combining Diacritical Marks block (U+0300-U+036F). This would fail for other languages and symbols, but might be enough for what you want to do.

function findGraphemesNotVeryWell(s) {
    var re= /.[\u0300-\u036F]*/g;
    var match, matches= [];
    while (match= re.exec(s))
        matches.push(match[0]);
    return matches;
}

findGraphemesNotVeryWell('Ааа́Ббб́Ввв́ГгҐґДд');
["А", "а", "а́", "Б", "б", "б́", "В", "в", "в́", "Г", "г", "Ґ", "ґ", "Д", "д"]

(*: there might be a way to extract the information by letting the browser render the string, and measuring the positions of selections in it... but it would surely be very messy and difficult to get working cross-browser.)

Unicode string with diacritics split by chars

1) Emoji and surrogate pairs

2) Diacritics

Tags:

Javascript

String

Split

Unicode

Char

Related

Recent Posts