char vs wchar_t when to use which data type
Fundamentally, use wchar_t
when the encoding has more symbols than a char
can contain.
Background
The char
type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char
type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t
, a.k.a. wide characters, provides more room for encodings.
Summary
Use char
data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t
when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).
Never use wchar_t
.
When possible, use (some kind of array of) char
, such as std::string
, and ensure that it is encoded in UTF-8.
When you must interface with APIs that don't speak UTF-8, use char16_t
or char32_t
. Never use them otherwise; they provide only illusory advantages and encourage faulty code.
Note that there are plenty of cases where more than one char32_t
is required to represent a single user-visible character. OTOH, using UTF-8 with char
forces you to handle variable width very early.
Short anwser:
You should never use wchar_t
in modern C++, except when interacting with OS-specific APIs (basically use wchar_t
only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what
).
In a C++ program you have two locales:
- Standard C library locale set by
std::setlocale
- Standard C++ library locale set by
std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen
, std::fstream::open
etc). Behavior differs between OSes:
- Linux is encoding agnostic, so those function simply pass char string to underlying system call
- On Windows char string is converted to wide string using user specific locale before system call is made
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main
functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C"
locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string
assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *)
and std::fstream::open(const char *)
to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen
, std::fstream::open(const wchar_t *)
on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path
to store file path in a portable way, but it is still broken on Windows:
- Implicit constructor
std::filesystem::path::path(const char *)
uses user-specific locale on MSVC and there is no way to make it use UTF-8. Functionstd::filesystem::u8string
should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead. std::error_category::message(int)
for both error categories returns error description using user-specific encoding.
So what we have on Windows is:
- Standard library functions that open files are broken and should never be used.
- Arguments passed to
main(int, char**)
are broken and should never be used. - WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path
is partially broken and should never be used directly.- Error categories returned by
std::generic_category
andstd::system_category
are broken and should never be used.
If you need long term solution for a non-trivial project, I would recommend:
- Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
- Re-implementing standard error categories returned by
std::generic_category
andstd::system_category
so that they would always return UTF-8 encoded strings. - Wrapping
std::filesystem::path
so that new class would always use UTF-8 when converting path to string and string to path. - Wrapping all required functions from
std::filesystem
so that they would use your path wrapper and your error categories.
Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/