Convert a String In C++ To Upper Case

#include <algorithm>
#include <string>

std::string str = "Hello World";
std::transform(str.begin(), str.end(),str.begin(), ::toupper);

Boost string algorithms:

#include <boost/algorithm/string.hpp>
#include <string>

std::string str = "Hello World";

boost::to_upper(str);

std::string newstr = boost::to_upper_copy<std::string>("Hello World");

Short solution using C++11 and toupper().

for (auto & c: str) c = toupper(c);

This problem is vectorizable with SIMD for the ASCII character set.


Speedup comparisons:

Preliminary testing with x86-64 gcc 5.2 -O3 -march=native on a Core2Duo (Merom). The same string of 120 characters (mixed lowercase and non-lowercase ASCII), converted in a loop 40M times (with no cross-file inlining, so the compiler can't optimize away or hoist any of it out of the loop). Same source and dest buffers, so no malloc overhead or memory/cache effects: data is hot in L1 cache the whole time, and we're purely CPU-bound.

  • boost::to_upper_copy<char*, std::string>(): 198.0s. Yes, Boost 1.58 on Ubuntu 15.10 is really this slow. I profiled and single-stepped the asm in a debugger, and it's really, really bad: there's a dynamic_cast of a locale variable happening per character!!! (dynamic_cast takes multiple calls to strcmp). This happens with LANG=C and with LANG=en_CA.UTF-8.

    I didn't test using a RangeT other than std::string. Maybe the other form of to_upper_copy optimizes better, but I think it will always new/malloc space for the copy, so it's harder to test. Maybe something I did differs from a normal use-case, and maybe normally stopped g++ can hoist the locale setup stuff out of the per-character loop. My loop reading from a std::string and writing to a char dstbuf[4096] makes sense for testing.

  • loop calling glibc toupper: 6.67s (not checking the int result for potential multi-byte UTF-8, though. This matters for Turkish.)

  • ASCII-only loop: 8.79s (my baseline version for the results below.) Apparently a table-lookup is faster than a cmov, with the table hot in L1 anyway.
  • ASCII-only auto-vectorized: 2.51s. (120 chars is half way between worst case and best case, see below)
  • ASCII-only manually vectorized: 1.35s

See also this question about toupper() being slow on Windows when a locale is set.


I was shocked that Boost is an order of magnitude slower than the other options. I double-checked that I had -O3 enabled, and even single-stepped the asm to see what it was doing. It's almost exactly the same speed with clang++ 3.8. It has huge overhead inside the per-character loop. The perf record / report result (for the cycles perf event) is:

  32.87%  flipcase-clang-  libstdc++.so.6.0.21   [.] _ZNK10__cxxabiv121__vmi_class_type_info12__do_dyncastElNS_17__class_type_info10__sub_kindEPKS1_PKvS4_S6_RNS1_16
  21.90%  flipcase-clang-  libstdc++.so.6.0.21   [.] __dynamic_cast                                                                                                 
  16.06%  flipcase-clang-  libc-2.21.so          [.] __GI___strcmp_ssse3                                                                                            
   8.16%  flipcase-clang-  libstdc++.so.6.0.21   [.] _ZSt9use_facetISt5ctypeIcEERKT_RKSt6locale                                                                     
   7.84%  flipcase-clang-  flipcase-clang-boost  [.] _Z16strtoupper_boostPcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE                                   
   2.20%  flipcase-clang-  libstdc++.so.6.0.21   [.] strcmp@plt                                                                                                     
   2.15%  flipcase-clang-  libstdc++.so.6.0.21   [.] __dynamic_cast@plt                                                                                             
   2.14%  flipcase-clang-  libstdc++.so.6.0.21   [.] _ZNKSt6locale2id5_M_idEv                                                                                       
   2.11%  flipcase-clang-  libstdc++.so.6.0.21   [.] _ZNKSt6locale2id5_M_idEv@plt                                                                                   
   2.08%  flipcase-clang-  libstdc++.so.6.0.21   [.] _ZNKSt5ctypeIcE10do_toupperEc                                                                                  
   2.03%  flipcase-clang-  flipcase-clang-boost  [.] _ZSt9use_facetISt5ctypeIcEERKT_RKSt6locale@plt                                                                 
   0.08% ...

Autovectorization

Gcc and clang will only auto-vectorize loops when the iteration count is known ahead of the loop. (i.e. search loops like plain-C implementation of strlen won't autovectorize.)

Thus, for strings small enough to fit in cache, we get a significant speedup for strings ~128 chars long from doing strlen first. This won't be necessary for explicit-length strings (like C++ std::string).

// char, not int, is essential: otherwise gcc unpacks to vectors of int!  Huge slowdown.
char ascii_toupper_char(char c) {
    return ('a' <= c && c <= 'z') ? c^0x20 : c;    // ^ autovectorizes to PXOR: runs on more ports than paddb
}

// gcc can only auto-vectorize loops when the number of iterations is known before the first iteration.  strlen gives us that
size_t strtoupper_autovec(char *dst, const char *src) {
    size_t len = strlen(src);
    for (size_t i=0 ; i<len ; ++i) {
        dst[i] = ascii_toupper_char(src[i]);  // gcc does the vector range check with psubusb / pcmpeqb instead of pcmpgtb
    }
    return len;
}

Any decent libc will have an efficient strlen that's much faster than looping a byte at a time, so separate vectorized strlen and toupper loops are faster.

Baseline: a loop that checks for a terminating 0 on the fly.

Times for 40M iterations, on a Core2 (Merom) 2.4GHz. gcc 5.2 -O3 -march=native. (Ubuntu 15.10). dst != src (so we make a copy), but they don't overlap (and aren't nearby). Both are aligned.

  • 15 char string: baseline: 1.08s. autovec: 1.34s
  • 16 char string: baseline: 1.16s. autovec: 1.52s
  • 127 char string: baseline: 8.91s. autovec: 2.98s // non-vector cleanup has 15 chars to process
  • 128 char string: baseline: 9.00s. autovec: 2.06s
  • 129 char string: baseline: 9.04s. autovec: 2.07s // non-vector cleanup has 1 char to process

Some results are a bit different with clang.

The microbenchmark loop that calls the function is in a separate file. Otherwise it inlines and strlen() gets hoisted out of the loop, and it runs dramatically faster, esp. for 16 char strings (0.187s).

This has the major advantage that gcc can auto-vectorize it for any architecture, but the major disadvantage that it's slower for the usually-common case of small strings.


So there are big speedups, but compiler auto-vectorization doesn't make great code, esp. for cleanup of the last up-to-15 characters.

Manual vectorization with SSE intrinsics:

Based on my case-flip function that inverts the case of every alphabetic character. It takes advantage of the "unsigned compare trick", where you can do low < a && a <= high with a single unsigned comparison by range shifting, so that any value less than low wraps to a value that's greater than high. (This works if low and high aren't too far apart.)

SSE only has a signed compare-greater, but we can still use the "unsigned compare" trick by range-shifting to the bottom of the signed range: Subtract 'a'+128, so the alphabetic characters range from -128 to -128+25 (-128+'z'-'a')

Note that adding 128 and subtracting 128 are the same thing for 8bit integers. There's nowhere for the carry to go, so it's just xor (carryless add), flipping the high bit.

#include <immintrin.h>

__m128i upcase_si128(__m128i src) {
    // The above 2 paragraphs were comments here
    __m128i rangeshift = _mm_sub_epi8(src, _mm_set1_epi8('a'+128));
    __m128i nomodify   = _mm_cmpgt_epi8(rangeshift, _mm_set1_epi8(-128 + 25));  // 0:lower case   -1:anything else (upper case or non-alphabetic).  25 = 'z' - 'a'

    __m128i flip  = _mm_andnot_si128(nomodify, _mm_set1_epi8(0x20));            // 0x20:lcase    0:non-lcase

    // just mask the XOR-mask so elements are XORed with 0 instead of 0x20
    return          _mm_xor_si128(src, flip);
    // it's easier to xor with 0x20 or 0 than to AND with ~0x20 or 0xFF
}

Given this function that works for one vector, we can call it in a loop to process a whole string. Since we're already targeting SSE2, we can do a vectorized end-of-string check at the same time.

We can also do much better for the "cleanup" of the last up-to-15 bytes left over after doing vectors of 16B: upper-casing is idempotent, so re-processing some input bytes is fine. We do an unaligned load of the last 16B of the source, and store it into the dest buffer overlapping the last 16B store from the loop.

The only time this doesn't work is when the whole string is under 16B: Even when dst=src, non-atomic read-modify-write is not the same thing as not touching some bytes at all, and can break multithreaded code.

We have a scalar loop for that, and also to get src aligned. Since we don't know where the terminating 0 will be, an unaligned load from src might cross into the next page and segfault. If we need any bytes in an aligned 16B chunk, it's always safe to load the whole aligned 16B chunk.

Full source: in a github gist.

// FIXME: doesn't always copy the terminating 0.
// microbenchmarks are for this version of the code (with _mm_store in the loop, instead of storeu, for Merom).
size_t strtoupper_sse2(char *dst, const char *src_begin) {
    const char *src = src_begin;
    // scalar until the src pointer is aligned
    while ( (0xf & (uintptr_t)src) && *src ) {
        *(dst++) = ascii_toupper(*(src++));
    }

    if (!*src)
        return src - src_begin;

    // current position (p) is now 16B-aligned, and we're not at the end
    int zero_positions;
    do {
        __m128i sv = _mm_load_si128( (const __m128i*)src );
        // TODO: SSE4.2 PCMPISTRI or PCMPISTRM version to combine the lower-case and '\0' detection?

        __m128i nullcheck = _mm_cmpeq_epi8(_mm_setzero_si128(), sv);
        zero_positions = _mm_movemask_epi8(nullcheck);
        // TODO: unroll so the null-byte check takes less overhead
        if (zero_positions)
            break;

        __m128i upcased = upcase_si128(sv);   // doing this before the loop break lets gcc realize that the constants are still in registers for the unaligned cleanup version.  But it leads to more wasted insns in the early-out case

        _mm_storeu_si128((__m128i*)dst, upcased);
        //_mm_store_si128((__m128i*)dst, upcased);  // for testing on CPUs where storeu is slow
        src += 16;
        dst += 16;
    } while(1);

    // handle the last few bytes.  Options: scalar loop, masked store, or unaligned 16B.
    // rewriting some bytes beyond the end of the string would be easy,
    // but doing a non-atomic read-modify-write outside of the string is not safe.
    // Upcasing is idempotent, so unaligned potentially-overlapping is a good option.

    unsigned int cleanup_bytes = ffs(zero_positions) - 1;  // excluding the trailing null
    const char* last_byte = src + cleanup_bytes;  // points at the terminating '\0'

    // FIXME: copy the terminating 0 when we end at an aligned vector boundary
    // optionally special-case cleanup_bytes == 15: final aligned vector can be used.
    if (cleanup_bytes > 0) {
        if (last_byte - src_begin >= 16) {
            // if src==dest, this load overlaps with the last store:  store-forwarding stall.  Hopefully OOO execution hides it
            __m128i sv = _mm_loadu_si128( (const __m128i*)(last_byte-15) ); // includes the \0
            _mm_storeu_si128((__m128i*)(dst + cleanup_bytes - 15), upcase_si128(sv));
        } else {
            // whole string less than 16B
            // if this is common, try 64b or even 32b cleanup with movq / movd and upcase_si128
#if 1
            for (unsigned int i = 0 ; i <= cleanup_bytes ; ++i) {
                dst[i] = ascii_toupper(src[i]);
            }
#else
            // gcc stupidly auto-vectorizes this, resulting in huge code bloat, but no measurable slowdown because it never runs
            for (int i = cleanup_bytes - 1 ;  i >= 0 ; --i) {
                dst[i] = ascii_toupper(src[i]);
            }
#endif
        }
    }

    return last_byte - src_begin;
}

Times for 40M iterations, on a Core2 (Merom) 2.4GHz. gcc 5.2 -O3 -march=native. (Ubuntu 15.10). dst != src (so we make a copy), but they don't overlap (and aren't nearby). Both are aligned.

  • 15 char string: baseline: 1.08s. autovec: 1.34s. manual: 1.29s
  • 16 char string: baseline: 1.16s. autovec: 1.52s. manual: 0.335s
  • 31 char string: manual: 0.479s
  • 127 char string: baseline: 8.91s. autovec: 2.98s. manual: 0.925s
  • 128 char string: baseline: 9.00s. autovec: 2.06s. manual: 0.931s
  • 129 char string: baseline: 9.04s. autovec: 2.07s. manual: 1.02s

(Actually timed with _mm_store in the loop, not _mm_storeu, because storeu is slower on Merom even when the address is aligned. It's fine on Nehalem and later. I've also left the code as-is for now, instead of fixing the failure to copy the terminating 0 in some cases, because I don't want to re-time everything.)

So for short strings longer than 16B, this is dramatically faster than auto-vectorized. Lengths one-less-than-a-vector-width don't present a problem. They might be a problem when operating in-place, because of a store-forwarding stall. (But note that it's still fine to process our own output, rather than the original input, because toupper is idempotent).

There's a lot of scope for tuning this for different use-cases, depending on what the surrounding code wants, and the target microarchitecture. Getting the compiler to emit nice code for the cleanup portion is tricky. Using ffs(3) (which compiles to bsf or tzcnt on x86) seems to be good, but obviously that bit needs a re-think since I noticed a bug after writing up most of this answer (see the FIXME comments).

Vector speedups for even smaller strings can be obtained with movq or movd loads/stores. Customize as necessary for your use-case.


UTF-8:

We can detect when our vector has any bytes with the high bit set, and in that case fall back to a scalar utf-8-aware loop for that vector. The dst point can advance by a different amount than the src pointer, but once we get back to an aligned src pointer, we'll still just do unaligned vector stores to dst.

For text that's UTF-8, but mostly consists of the ASCII subset of UTF-8, this can be good: high performance in the common case with correct behaviour in all cases. When there's a lot of non-ASCII, it will probably be worse than staying in the scalar UTF-8 aware loop all the time, though.

Making English faster at the expense of other languages is not a future-proof decision if the downside is significant.


Locale-aware:

In the Turkish locale (tr_TR), the correct result from toupper('i') is 'İ' (U0130), not 'I' (plain ASCII). See Martin Bonner's comments on a question about tolower() being slow on Windows.

We can also check for an exception-list and fallback to scalar there, like for multi-byte UTF8 input characters.

With this much complexity, SSE4.2 PCMPISTRM or something might be able to do a lot of our checks in one go.

Tags:

C++

String