C++20 with u8, char8_t and std::string
In addition to @lubgr's answer, the paper char8_t backward compatibility remediation (P1423) discusses several ways how to make std::string
with char8_t
character arrays.
Basically the idea is that you can cast the u8
char array into a "normal" char array to get the same behaviour as C++17 and before, you just have to be a bit more explicit. The paper discusses various ways to do this.
The most simple (but not fully zero overhead, unless you add more overloads) method that fits your usecase is probably the last one, i.e. introduce explicit conversion functions:
std::string from_u8string(const std::string &s) {
return s;
}
std::string from_u8string(std::string &&s) {
return std::move(s);
}
#if defined(__cpp_lib_char8_t)
std::string from_u8string(const std::u8string &s) {
return std::string(s.begin(), s.end());
}
#endif
Should I be creating a new utf8string?
No, it's already there. P0482 does not only propose char8_t
, but also a new specialization of std::basic_string
for char8_t
character types named std::u8string
. So this already compiles with clang
and libc++
from trunk:
const std::u8string str = u8"●";
The fact that std::string
construction from a u8
-literal breaks is unfortunate. From the proposal:
This proposal does not specify any backward compatibility features other than to retain interfaces that it deprecates. The author believes such features are necessary, but that a single set of such features would unnecessarily compromise the goals of this proposal. Rather, the expectation is that implementations will provide options to enable more fine grained compatibility features.
But I guess most of such initialization as above should be grep
-able or be subject to some automatic clang
tooling fixes.
Should I be creating a new utf8string?
No, C++20 adds std::u8string
. However, I would recommend using std::string
instead because char8_t
is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms normal char
strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8
which will give you portable Unicode support on major operating systems.
For example, you cannot even write a Hello World program using u8 strings in C++20 (https://godbolt.org/z/E6rvj5):
std::cout << u8"Hello, world!\n"; // won't compile in C++20
On Windows with MSVC and pre-C++20 the situation is even worse because u8 strings may be silently corrupted. For example:
std::cout << "Привет, мир!\n";
will produce valid UTF-8 that may or may not be displayed in the console depending on its current code page while
std::cout << u8"Привет, мир!\n";
will almost definitely give you an invalid result such as ╨а╤Я╨б╨В╨а╤С╨а╨Ж╨а┬╡╨бтАЪ, ╨а╤Ш╨а╤С╨б╨В!
.