Is there a flaw in how clang implements char8_t or does some dark corner of the standard prohibit optimization?
This is not a "bug" in Clang; merely a missed opportunity for optimization.
You can replicate the Clang compiler output by using the same function taking an enum class
whose underlying type is unsigned char
. By contrast, GCC recognizes a difference between an enumerator with an underlying type of unsigned char
and char8_t
. It emits the same code for unsigned char
and char8_t
, but emits more complex code for the enum class
case.
So something about Clang's implementation of char8_t
seems to think of it more as a user-defined enumeration than as a fundamental type. It's best to just consider it an early implementation of the standard.
It should be noted that one of the most important differences between unsigned char
and char8_t
is aliasing requirements. unsigned char
pointers may alias with pretty much anything else. By contrast, char8_t
pointers cannot. As such, it is reasonable to expect (on a mature implementation, not something that beats the standard it implements to market) different code to be emitted in different cases. The trick is that char8_t
code ought to be more efficient if it's different, since the compiler no longer has to emit code that performs additional work to deal with potential aliasing from stores.
In libstdc++,
std::equal
calls__builtin_memcmp
when it detects that the arguments are "simple", otherwise it uses a naive for loop. "Simple" here means pointers (or certain iterator wrappers around pointer) to the same integer or pointer type.(relevant source code)- Whether a type is an integer type is detected by the internal
__is_integer
trait, but libstdc++ 8.2.0 (the version used on godbolt.org) does not specialize this trait forchar8_t
, so the latter is not detected as an integer type.(relevant source code)
- Whether a type is an integer type is detected by the internal
Clang (with this particular configuration) generates more verbose assembly in the for loop case than in the
__builtin_memcmp
case.(But the former is not necessarily less optimized in terms of performance. See Loop_unrolling.)
So there's a reason for this difference, and it's not a bug in clang IMO.