C++ performance challenge: integer to std::string conversion
#include <string>
const char digit_pairs[201] = {
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
};
std::string& itostr(int n, std::string& s)
{
if(n==0)
{
s="0";
return s;
}
int sign = -(n<0);
unsigned int val = (n^sign)-sign;
int size;
if(val>=10000)
{
if(val>=10000000)
{
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
}
else
{
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
}
}
else
{
if(val>=100)
{
if(val>=1000)
size=4;
else
size=3;
}
else
{
if(val>=10)
size=2;
else
size=1;
}
}
size -= sign;
s.resize(size);
char* c = &s[0];
if(sign)
*c='-';
c += size-1;
while(val>=100)
{
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
}
while(val>0)
{
*c--='0' + (val % 10);
val /= 10;
}
return s;
}
std::string& itostr(unsigned val, std::string& s)
{
if(val==0)
{
s="0";
return s;
}
int size;
if(val>=10000)
{
if(val>=10000000)
{
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
}
else
{
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
}
}
else
{
if(val>=100)
{
if(val>=1000)
size=4;
else
size=3;
}
else
{
if(val>=10)
size=2;
else
size=1;
}
}
s.resize(size);
char* c = &s[size-1];
while(val>=100)
{
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
}
while(val>0)
{
*c--='0' + (val % 10);
val /= 10;
}
return s;
}
This will blow up on systems that disallow unaligned memory accesses (in which case, the first unaligned assignment via *(short*)
would cause a segfault), but should work very nicely otherwise.
One important thing to do is to minimize the use of std::string
. (Ironic, I know.) In Visual Studio, for example, most calls to methods of std::string are not inlined, even if you specify /Ob2 in compiler options. So even something as trivial as a call to std::string::clear()
, which you might expect to be very fast, can take 100 clockticks when linking CRT as a static library, and as much as 300 clockticks when linking as a DLL.
For the same reason, returning by reference is better because it avoids an assignment, a constructor and a destructor.
Ah, awesome challenge by the way... I've had a lot of fun with this.
I have two algorithms to submit (code is at the bottom if you feel like skipping to it). In my comparisons I require that the function return a string and that it can handle int and unsigned int. Comparing things that don't construct a string to those that do doesn't really make sense.
The first one is a fun implementation that doesn't use any precomputed lookup tables or explicit division/modulo. This one is competitive with the others with gcc and with all but Timo's on msvc (for a good reason that I explain below). The second algorithm is my actual submission for highest performance. In my tests it beats all the others on both gcc and msvc.
I think I know why some of the results on MSVC are very good. std::string has two relevant constructors
std::string(char* str, size_t n)
and
std::string(ForwardIterator b, ForwardIterator e)
gcc does the same thing for both of them... that is it uses the second to implement the first. The first constructor can be implemented significantly more efficiently than that and MSVC does so. The side benefit of this is that in some cases (like my fast code and Timo's code) the string constructor can be inlined. In fact, just switching between these constructors in MSVC is almost a 2x difference for my code.
My performance testing results:
Code Sources:
- Voigt
- Timo
- ergosys
- user434507
- user-voigt-timo
- hopman-fun
- hopman-fast
gcc 4.4.5 -O2 on Ubuntu 10.10 64-bit, Core i5
hopman_fun: 124.688 MB/sec --- 8.020 s hopman_fast: 137.552 MB/sec --- 7.270 s voigt: 120.192 MB/sec --- 8.320 s user_voigt_timo: 97.9432 MB/sec --- 10.210 s timo: 120.482 MB/sec --- 8.300 s user: 97.7517 MB/sec --- 10.230 s ergosys: 101.42 MB/sec --- 9.860 s
MSVC 2010 64-bit /Ox on Windows 7 64-bit, Core i5
hopman_fun: 127 MB/sec --- 7.874 s hopman_fast: 259 MB/sec --- 3.861 s voigt: 221.435 MB/sec --- 4.516 s user_voigt_timo: 195.695 MB/sec --- 5.110 s timo: 253.165 MB/sec --- 3.950 s user: 212.63 MB/sec --- 4.703 s ergosys: 78.0518 MB/sec --- 12.812 s
Here are some results and a testing/timing framework on ideone
http://ideone.com/XZRqp
Note that ideone is a 32-bit environment. Both of my algorithms suffer from that, but hopman_fast is at least still competetive.
Note that for those the two or so that don't construct a string I added the following function template:
template <typename T>
std::string itostr(T t) {
std::string ret;
itostr(t, ret);
return ret;
}
Now for my code...first the fun one:
// hopman_fun
template <typename T>
T reduce2(T v) {
T k = ((v * 410) >> 12) & 0x000F000F000F000Full;
return (((v - k * 10) << 8) + k);
}
template <typename T>
T reduce4(T v) {
T k = ((v * 10486) >> 20) & 0xFF000000FFull;
return reduce2(((v - k * 100) << 16) + (k));
}
typedef unsigned long long ull;
inline ull reduce8(ull v) {
ull k = ((v * 3518437209u) >> 45);
return reduce4(((v - k * 10000) << 32) + (k));
}
template <typename T>
std::string itostr(T o) {
union {
char str[16];
unsigned short u2[8];
unsigned u4[4];
unsigned long long u8[2];
};
unsigned v = o < 0 ? ~o + 1 : o;
u8[0] = (ull(v) * 3518437209u) >> 45;
u8[0] = (u8[0] * 28147497672ull);
u8[1] = v - u2[3] * 100000000;
u8[1] = reduce8(u8[1]);
char* f;
if (u2[3]) {
u2[3] = reduce2(u2[3]);
f = str + 6;
} else {
unsigned short* k = u4[2] ? u2 + 4 : u2 + 6;
f = *k ? (char*)k : (char*)(k + 1);
}
if (!*f) f++;
u4[1] |= 0x30303030;
u4[2] |= 0x30303030;
u4[3] |= 0x30303030;
if (o < 0) *--f = '-';
return std::string(f, (str + 16) - f);
}
And then the fast one:
// hopman_fast
struct itostr_helper {
static unsigned out[10000];
itostr_helper() {
for (int i = 0; i < 10000; i++) {
unsigned v = i;
char * o = (char*)(out + i);
o[3] = v % 10 + '0';
o[2] = (v % 100) / 10 + '0';
o[1] = (v % 1000) / 100 + '0';
o[0] = (v % 10000) / 1000;
if (o[0]) o[0] |= 0x30;
else if (o[1] != '0') o[0] |= 0x20;
else if (o[2] != '0') o[0] |= 0x10;
else o[0] |= 0x00;
}
}
};
unsigned itostr_helper::out[10000];
itostr_helper hlp_init;
template <typename T>
std::string itostr(T o) {
typedef itostr_helper hlp;
unsigned blocks[3], *b = blocks + 2;
blocks[0] = o < 0 ? ~o + 1 : o;
blocks[2] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[2] = hlp::out[blocks[2]];
if (blocks[0]) {
blocks[1] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[1] = hlp::out[blocks[1]];
blocks[2] |= 0x30303030;
b--;
}
if (blocks[0]) {
blocks[0] = hlp::out[blocks[0] % 10000];
blocks[1] |= 0x30303030;
b--;
}
char* f = ((char*)b);
f += 3 - (*f >> 4);
char* str = (char*)blocks;
if (o < 0) *--f = '-';
return std::string(f, (str + 12) - f);
}
Benchmark data for the code provided in the question:
On ideone (gcc 4.3.4):
- stringstreams: 4.4 MB/s
- sprintf: 25.0 MB/s
- mine (Ben Voigt): 55.8 MB/s
- Timo: 58.5 MB/s
- user434507: 199 MB/s
- user434507's Ben-Timo-507 hybrid: 263 MB/s
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 32-bit:
cl /Ox /EHsc
- stringstreams: 3.39 MB/s, 3.67 MB/s
- sprintf: 16.8 MB/s, 16.2 MB/s
- mine: 194 MB/s, 207 MB/s (with PGO enabled: 250 MB/s)
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 64-bit:
cl /Ox /EHsc
- stringstreams: 4.42 MB/s, 4.92 MB/s
- sprintf: 21.0 MB/s, 20.8 MB/s
- mine: 238 MB/s, 228 MB/s
Core i7, Windows 7 64-bit, 8 GB RAM, cygwin gcc 4.3.4:
g++ -O3
- stringstreams: 2.19 MB/s, 2.17 MB/s
- sprintf: 13.1 MB/s, 13.4 MB/s
- mine: 30.0 MB/s, 30.2 MB/s
edit: I was gonna add my own answer, but the question was was closed so I'm adding it here. :) I wrote my own algorithm and managed to get a decent improvement over Ben's code, though I only tested it in MSVC 2010. I also made a benchmark of all the implementations presented so far, using the same testing setup that was in Ben's original code. -- Timo
Intel Q9450, Win XP 32bit, MSVC 2010
cl /O2 /EHsc
- stringstream: 2.87 MB/s
- sprintf: 16.1 MB/s
- Ben: 202 MB/s
- Ben (unsigned buffer): 82.0 MB/s
- ergosys (updated version): 64.2 MB/s
- user434507: 172 MB/s
- Timo: 241 MB/s
-
const char digit_pairs[201] = {
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
};
static const int BUFFER_SIZE = 11;
std::string itostr(int val)
{
char buf[BUFFER_SIZE];
char *it = &buf[BUFFER_SIZE-2];
if(val>=0) {
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
} else {
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[-2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[-2*val],2);
if(val<=-10)
it--;
*it = '-';
}
return std::string(it,&buf[BUFFER_SIZE]-it);
}
std::string itostr(unsigned int val)
{
char buf[BUFFER_SIZE];
char *it = (char*)&buf[BUFFER_SIZE-2];
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
return std::string((char*)it,(char*)&buf[BUFFER_SIZE]-(char*)it);
}