How would you get an array of Unicode code points from a .NET String?
Doesn't seem like it should be much more complicated than this:
public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
bool useBigEndian = !BitConverter.IsLittleEndian;
Encoding utf32 = new UTF32Encoding( useBigEndian , false , true ) ;
byte[] octets = utf32.GetBytes( s ) ;
for ( int i = 0 ; i < octets.Length ; i+=4 )
{
int codePoint = BitConverter.ToInt32(octets,i);
yield return codePoint;
}
}
This answer is not correct. See @Virtlink's answer for the correct one.
static int[] ExtractScalars(string s)
{
if (!s.IsNormalized())
{
s = s.Normalize();
}
List<int> chars = new List<int>((s.Length * 3) / 2);
var ee = StringInfo.GetTextElementEnumerator(s);
while (ee.MoveNext())
{
string e = ee.GetTextElement();
chars.Add(char.ConvertToUtf32(e, 0));
}
return chars.ToArray();
}
Notes: Normalization is required to deal with composite characters.
You are asking about code points. In UTF-16 (C#'s char
) there are only two possibilities:
- The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
- The character is outside the BMP, and encoded using a surrogare high-low pair of code units
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str)
{
if (str == null)
throw new ArgumentNullException("str");
var codePoints = new List<int>(str.Length);
for (int i = 0; i < str.Length; i++)
{
codePoints.Add(Char.ConvertToUtf32(str, i));
if (Char.IsHighSurrogate(str[i]))
i += 1;
}
return codePoints.ToArray();
}
An example with a surrogate pair ð
and a composed character ñ
:
ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // ð El Niño
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // ð E l N i n ̃◌ o
Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:
ToCodePoints("\U0001D162\U0001D181"); // ð
¢ð
// { 0x1d162, 0x1d181 } // ð
¢ ð◌
When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:
ToCodePoints("\U0001D162\U0001D181".Normalize()); // ð
ð
¥ð
°ð
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // ð
ð
¥ ð
° ð◌
Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ
in the string is represented by a Latin lowercase n
followed by a combining tilde ̃◌
. Leppie's solution discards any combining characters that cannot be normalized into a single code point.