Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor of UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.

UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.

Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.


std::wstring is technically UCS-2: two bytes are used for each character and the code tables mostly map to Unicode format. It's important to understand that UCS-2 is not the same as UTF-16! UTF-16 allows "surrogate pairs" in order to represent characters which are outside of the two-byte range, but UCS-2 uses exactly two bytes for each character, period.

The best rule for your situation is to do your transcoding when you read and write to the disk. Once it's in memory, keep it in UCS-2 format. Windows APIs will read it as if it were UTF-16 (which is to say, while std::wstring doesn't understand the concept of surrogate pairs, if you manually create them (which you won't, if your only language is English), Windows will read them).

Whenever you're reading data in or out of serialization formats (such as XML) in the modern day, you'll probably need to do transcoding. It's an unpleasant and very unfortunate fact of life, but inevitable since Unicode is a variable-width character encoding and most character-based operations in C++ are done as arrays, for which you need consistent spacing.

Higher-level frameworks, such as .NET, obscure most of the details, but behind the scenes, they're handling the transcoding in the same fashion: changing variable-width data to fixed-width strings, manipulating them, and then changing them back into variable-width encodings when required for output.


AFAIK when you work with std::wstring on Windows in C++ and store using UTF-8 in files (which sounds good and reasonable), then you have to convert the data to UTF-8 when writing to a file, and convert back to UTF-16 when reading from a file. Check out this link: Writing UTF-8 Files in C++.

I would stick with the Visual Studio default of project -> Properties -> Configuration Properties -> General -> Character Set -> Use Unicode Character Set, use the wchar_t type (i.e. with std::wstring) and not use the TCHAR type. (E.g. I would just use the wcslen version of strlen and not _tcslen.)