Convert Unicode surrogate pair to literal string

It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.

You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.

First, define the following extension method:

public static class TextExtensions
{
    public static IEnumerable<string> TextElements(this string s)
    {
        // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
        if (s == null)
            yield break;
        var enumerator = StringInfo.GetTextElementEnumerator(s);
        while (enumerator.MoveNext())
            yield return enumerator.GetTextElement();
    }
}

Now, you can do:

var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";

Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.

Sample fiddle here.

In Unicode, you have code points. These are 21 bits long. Your character ð, Mathematical Bold Capital A, has a code point of U+1D400.

In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.

In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.

This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.

So your code point ð (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:

var highUnicodeChar = "ð";
char a = highUnicodeChar[0]; // code unit 0xD835
char b = highUnicodeChar[1]; // code unit 0xDC00

Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.

You can use IsSurrogatePair to test for a surrogate pair. For instance:

string GetFullCodePointAtIndex(string s, int idx) =>
    s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);

Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.

To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.

Convert Unicode surrogate pair to literal string

Tags:

C#

.Net

Unicode

Unicode Escapes

Related

Recent Posts