Finding out Unicode character name in .Net

It's easier than ever now, as there's a package in nuget named Unicode Information

With this, you can just call:

UnicodeInfo.GetName(character)

Here's a solution you can implement immediately, like copy/paste/compile.

First, download the Unicode database (UCD) here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Next, add this code to your project to read the UCD and create a Dictionary for looking up the name of a .NET char value:

string[] unicodedata = File.ReadAllLines( "UnicodeData.txt", Encoding.UTF8 );
Dictionary<char,string> charname_map = new Dictionary<char,string>( 65536 );
for (int i = 0; i < unicodedata.Length; i++)
{
    string[] fields = unicodedata[i].Split( ';' );
    int char_code = int.Parse( fields[0], NumberStyles.HexNumber );
    string char_name = fields[1];
    if (char_code >= 0 && char_code <= 0xFFFF) //UTF-16 BMP code points only
    {
        bool is_range = char_name.EndsWith( ", First>" );
        if (is_range) //add all characters within a specified range
        {
            char_name = char_name.Replace( ", First", String.Empty ); //remove range indicator from name
            fields = unicodedata[++i].Split( ';' );
            int end_char_code = int.Parse( fields[0], NumberStyles.HexNumber );
            if (!fields[1].EndsWith( ", Last>" ))
                throw new Exception( "Expected end-of-range indicator." );
            for (int code_in_range = char_code; code_in_range <= end_char_code; code_in_range++)
                charname_map.Add( (char)code_in_range, char_name );
        }
        else
            charname_map.Add( (char)char_code, char_name );
    }
}

The UnicodeData.txt file is UTF-8 encoded, and consists of one line of information for each Unicode code point. Each line contains a semi-colon-separated list of fields, where the first field is the Unicode code point in hexadecimal (with no prefixes) and the second field is the character name. Information about the file and the other fields each line contains can be found here: Infomation on the format of the UCD can be found here: http://www.unicode.org/reports/tr44/#Format_Conventions

Once you use the above code to build a mapping of characters to character names, you just retrieve them from the map with something like this:

char c = 'Ã';
string character_name;
if (!charname_map.TryGetValue( c, out character_name ))
    character_name = "<Character Name Missing>"; //character not found in map
//character_name should now contain "LATIN CAPITAL LETTER A WITH CIRCUMFLEX";

I suggest embedding the UnicodeData.txt file in your application resources, and wrapping this code into a class, which loads and parses the file once in a static initializer. To make code more readable, you could implement an extension method in that class 'char' class like 'GetUnicodeName'. I've purposely restricted the values to the range 0 through 0xFFFF, because that's all a .NET UTF-16 char can hold. .NET char doesn't actually represent a true "character" (also called code point), but rather a Unicode UTF-16 code unit, since some "characters" actually require two code units. Such a pair of code units are called a high and low surrogate. Values above 0xFFFF (the largest value a 16-bit char can store) are outside the Basic Multilingual Plane (BMP), and according to UTF-16 encoding require two chars to encode. Individual codes that are part of a surrogate pair will end up with names like "Non Private Use High Surrogate", "Private Use High Surrogate", and "Low Surrogate" with this implementation.

If you use Process Monitor to look at the files accessed by charmap.exe, you'll see that it opens a file named C:\Windows\system32\getuname.dll. This file contains the character names in its resources (actually the resources themselves are in a .mui file in a culture-specific subdirectory).

So all you have to do is get the names from this file, using the LoadString API. I wrote a helper class to do it:

public class Win32ResourceReader : IDisposable
{
    private IntPtr _hModule;

    public Win32ResourceReader(string filename)
    {
        _hModule = LoadLibraryEx(filename, IntPtr.Zero, LoadLibraryFlags.AsDataFile | LoadLibraryFlags.AsImageResource);
        if (_hModule == IntPtr.Zero)
            throw Marshal.GetExceptionForHR(Marshal.GetHRForLastWin32Error());
    }

    public string GetString(uint id)
    {
        var buffer = new StringBuilder(1024);
        LoadString(_hModule, id, buffer, buffer.Capacity);
        if (Marshal.GetLastWin32Error() != 0)
            throw Marshal.GetExceptionForHR(Marshal.GetHRForLastWin32Error());
        return buffer.ToString();
    }

    ~Win32ResourceReader()
    {
        Dispose(false);
    }

    public void Dispose()
    {
        Dispose(true);
        GC.SuppressFinalize(this);
    }

    public void Dispose(bool disposing)
    {
        if (_hModule != IntPtr.Zero)
            FreeLibrary(_hModule);
        _hModule = IntPtr.Zero;
    }

    [DllImport("user32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern int LoadString(IntPtr hInstance, uint uID, StringBuilder lpBuffer, int nBufferMax);

    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern IntPtr LoadLibraryEx(string lpFileName, IntPtr hReservedNull, LoadLibraryFlags dwFlags);

    [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
    static extern bool FreeLibrary(IntPtr hModule);

    [Flags]
    enum LoadLibraryFlags : uint
    {
        AsDataFile = 0x00000002,
        AsImageResource = 0x00000020
    }
}

You can use it like this:

string path = @"C:\Windows\System32\getuname.dll";
using (var reader = new Win32ResourceReader(path))
{
    string name = reader.GetString(0xA9);
    Console.WriteLine(name); // Copyright Sign
}

Finding out Unicode character name in .Net

Tags:

.Net

Unicode

Related

Recent Posts