Is there a more efficient way of expanding a char to an uint64_t?
If you're looking for efficiency use a lookup table: a static array of 256 entries, each already holding the required result. You can use your code above to generate it.
In selected architectures (SSE,Neon) there are fast vector operations that can speed up this task or are designed to do this. Without special instructions the suggested look up table approach is both the fastest and most portable.
If the 2k size is an issue, parallel vector arithmetic operations can be simulated:
static uint64_t inflate_parallel(unsigned char a) {
uint64_t vector = a * 0x0101010101010101ULL;
// replicate the word all over qword
// A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
vector &= 0x8040201008040201; // becomes 80 00 20 00 00 04 00 01 <--
vector += 0x00406070787c7e7f; // becomes 80 40 80 70 78 80 7e 80
// MSB is correct
vector = (vector >> 7) & 0x0101010101010101ULL; // LSB is correct
return vector * 255; // all bits correct
}
EDIT: 2^31 iterations, (four time unroll to mitigate loop evaluation)
time ./parallel time ./original time ./lookup
real 0m2.038s real 0m14.161s real 0m1.436s
user 0m2.030s user 0m14.120s user 0m1.430s
sys 0m0.000s sys 0m0.000s sys 0m0.000s
That's about 7x speedup, while the lookup table gives ~10x speedup
You should profile what your code does, before worrying about optimising it.
On my compiler locally, your code gets entirely inlined, unrolled and turned into 8 constant test + or instructions when the value is unknown, and turned into a constant when the value is known at compile time. I could probably marginally improve it by removing a few branches, but the compiler is doing a reasonable job on its own.
Optimising the loop is then a bit pointless. A table lookup might be more efficient, but would probably prevent the compiler from making optimisations itself.