Return code point of characters in C#
The following code writes the codepoints of a string
input to the console:
string input = "\uD834\uDD61";
for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
var codepoint = char.ConvertToUtf32(input, i);
Console.WriteLine("U+{0:X4}", codepoint);
}
Output:
U+1D161
Since strings in .NET are UTF-16 encoded, the char
values that make up the string need to be converted to UTF-32 first.
Easy, since chars in C# is actually UTF16 code points:
char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);
To address the comments, A char
in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair
and Char.ConvertToUtf32
, as suggested in another answer:
string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
int x = Char.ConvertToUtf32(input, i);
Console.WriteLine("U+{0:X4}", x);
}
C# cannot store unicode codepoints in a char
, as char
is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.
This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:
public class InvalidEncodingException : System.Exception
{ }
public static IEnumerable<string> UnicodeCodepoints(this string s)
{
for (int i = 0; i < s.Length; ++i)
{
if (Char.IsSurrogate(s[i]))
{
if (s.Length < i + 2)
{
throw new InvalidEncodingException();
}
yield return string.Format("{0}{1}", s[i], s[++i]);
}
else
{
yield return string.Format("{0}", s[i]);
}
}
}
}