How to reverse a string that contains complicated emojis?
I took TKoL's idea of using the \u200d
character and used it to attempt to create a smaller script.
Note: Not all compositions use a zero width joiner so it will be buggy with other composition characters.
It uses the traditional for
loop because we skip some iterations in case we find combined emoticons. Within the for
loop there is a while
loop to check if there is a following \u200d
character. As long there is one we add the next 2 characters as well and forward the for
loop with 2 iterations so combined emoticons are not reversed.
To easily use it on any string I made it as a new prototype function on the string object.
String.prototype.reverse = function() {
let textArray = [...this];
let reverseString = "";
for (let i = 0; i < textArray.length; i++) {
let char = textArray[i];
while (textArray[i + 1] === '\u200d') {
char += textArray[i + 1] + textArray[i + 2];
i = i + 2;
}
reverseString = char + reverseString;
}
return reverseString;
}
const text = "Hello worldð©ð¦°ð©ð©ð¦ð¦";
console.log(text.reverse());
//Fun fact, you can chain them to double reverse :)
//console.log(text.reverse().reverse());
Reversing Unicode text is tricky for a lot of reasons.
First, depending on the programming language, strings are represented in different ways, either as a list of bytes, a list of UTF-16 code units (16 bits wide, often called "characters" in the API), or as ucs4 code points (4 bytes wide).
Second, different APIs reflect that inner representation to different degrees. Some work on the abstraction of bytes, some on UTF-16 characters, some on code points. When the representation uses bytes or UTF-16 characters, there are usually parts of the API that give you access to the elements of this representation, as well as parts that perform the necessary logic to get from bytes (via UTF-8) or from UTF-16 characters to the actual code points.
Often, the parts of the API performing that logic and thus giving you access to the code points have been added later, as first there was 7 bit ascii, then a bit later everybody thought 8 bits were enough, using different code pages, and even later that 16 bits were enough for unicode. The notion of code points as integer numbers without a fixed upper limit was historically added as the fourth common character length for logically encoding text.
Using an API that gives you access to the actual code points seems like that's it. But...
Third, there are a lot of modifier code points affecting the next code point or following code points. E.g. there's a diacritic modifier turning a following a into an ä, e to ë, &c. Turn the code points around, and aë becomes eä, made of different letters. There is a direct representation of e.g. ä as its own code point but using the modifier is just as valid.
Fourth, everything is in constant flux. There are also a lot of modifiers among the emoji, as used in the example, and more are added every year. Therefore, if an API gives you access to the information whether a code point is a modifier, the version of the API will determine whether it already knows a specific new modifier.
Unicode provides a hacky trick, though, for when it's only about the visual appearance:
There are writing direction modifiers. In the case of the example, left-to-right writing direction is used. Just add a right-to-left writing direction modifier at the beginning of the text and depending on the version of the API / browser, it will look correctly reversed ð
'\u202e' is called right to left override, it is the strongest version of the right to left marker.
See this explanation by w3.org
const text = 'Hello worldð©ð¦°ð©ð©ð¦ð¦'
console.log('\u202e' + text)
const text = 'Hello worldð©ð¦°ð©ð©ð¦ð¦'
let original = document.getElementById('original')
original.appendChild(document.createTextNode(text))
let result = document.getElementById('result')
result.appendChild(document.createTextNode('\u202e' + text))
body {
font-family: sans-serif
}
<p id="original"></p>
<p id="result"></p>
If you're able to, use the _.split()
function provided by lodash. From version 4.0 onwards, _.split()
is capable of splitting unicode emojis.
Using the native .reverse().join('')
to reverse the 'characters' should work just fine with emojis containing zero-width joiners
function reverse(txt) { return _.split(txt, '').reverse().join(''); }
const text = 'Hello worldð©ð¦°ð©ð©ð¦ð¦';
console.log(reverse(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.20/lodash.min.js" integrity="sha512-90vH1Z83AJY9DmlWa8WkjkV79yfS2n2Oxhsi2dZbIv0nC4E6m5AbH8Nh156kkM7JePmqD6tcZsfad1ueoaovww==" crossorigin="anonymous"></script>