How do I write a maintainable, fast, compile-time bit-mask in C++?
if using only C++11 is a must (&a)[N]
is a way to capture arrays. This allows you to write one single recursive function without using helper functions whatsoever:
template <std::size_t N>
constexpr std::uint64_t generate_mask(Flags const (&a)[N], std::size_t i = 0u){
return i < N ? (1ull << a[i] | generate_mask(a, i + 1u)) : 0ull;
}
assigning it to a constexpr auto
:
void apply_known_mask(std::bitset<64>& bits) {
constexpr const Flags important_bits[] = { B, D, E, H, K, M, L, O };
constexpr auto m = generate_mask(important_bits); //< here
bits &= m;
}
Test
int main() {
std::bitset<64> b;
b.flip();
apply_known_mask(b);
std::cout << b.to_string() << '\n';
}
Output
0000000000000000000000000000000000101110010000000000000100100100
// ^ ^^^ ^ ^ ^ ^
// O MLK H E D B
one really has to appreciate C++`s ability to calculate anything turing computable at compile time. It surely still blows my mind (<>).
For the later versions C++14 and C++17 yakk's answer already wonderfully covers that.
Best version is c++17:
template< unsigned char... indexes >
constexpr unsigned long long mask(){
return ((1ull<<indexes)|...|0ull);
}
Then
void apply_known_mask(std::bitset<64> &bits) {
constexpr auto m = mask<B,D,E,H,K,M,L,O>();
bits &= m;
}
back in c++14, we can do this strange trick:
template< unsigned char... indexes >
constexpr unsigned long long mask(){
auto r = 0ull;
using discard_t = int[]; // data never used
// value never used:
discard_t discard = {0,(void(
r |= (1ull << indexes) // side effect, used
),0)...};
(void)discard; // block unused var warnings
return r;
}
or, if we are stuck with c++11, we can solve it recursively:
constexpr unsigned long long mask(){
return 0;
}
template<class...Tail>
constexpr unsigned long long mask(unsigned char b0, Tail...tail){
return (1ull<<b0) | mask(tail...);
}
template< unsigned char... indexes >
constexpr unsigned long long mask(){
return mask(indexes...);
}
Godbolt with all 3 -- you can switch CPP_VERSION define, and get identical assembly.
In practice I'd use the most modern I could. 14 beats 11 because we don't have recursion and hence O(n^2) symbol length (which can explode compile time and compiler memory usage); 17 beats 14 because the compiler doesn't have to dead-code-eliminate that array, and that array trick is just ugly.
Of these 14 is the most confusing. Here we create an anonymous array of all 0s, meanwhile as a side effect construct our result, then discard the array. The discarded array has a number of 0s in it equal to the size of our pack, plus 1 (which we add so we can handle empty packs).
A detailed explanation of what the c++14 version is doing. This is a trick/hack, and the fact you have to do this to expand parameters packs with efficiency in C++14 is one of the reasons why fold expressions were added in c++17.
It is best understood from the inside out:
r |= (1ull << indexes) // side effect, used
this just updates r
with 1<<indexes
for a fixed index. indexes
is a parameter pack, so we'll have to expand it.
The rest of the work is to provide a parameter pack to expand indexes
inside of.
One step out:
(void(
r |= (1ull << indexes) // side effect, used
),0)
here we cast our expression to void
, indicating we don't care about its return value (we just want the side effect of setting r
-- in C++, expressions like a |= b
also return the value they set a
to).
Then we use the comma operator ,
and 0
to discard the void
"value", and return the value 0
. So this is an expression whose value is 0
and as a side effect of calculating 0
it sets a bit in r
.
int discard[] = {0,(void(
r |= (1ull << indexes) // side effect, used
),0)...};
At this point, we expand the parameter pack indexes
. So we get:
{
0,
(expression that sets a bit and returns 0),
(expression that sets a bit and returns 0),
[...]
(expression that sets a bit and returns 0),
}
in the {}
. This use of ,
is not the comma operator, but rather the array element separator. This is sizeof...(indexes)+1
0
s, which also set bits in r
as a side effect. We then assign the {}
array construction instructions to an array discard
.
Next we cast discard
to void
-- most compilers will warn you if you create a variable and never read it. All compilers will not complain if you cast it to void
, it is sort of a way to say "Yes, I know, I'm not using this", so it suppresses the warning.
I would encourage you to write a proper EnumSet
type.
Writing a basic EnumSet<E>
in C++14 (onwards) based on std::uint64_t
is trivial:
template <typename E>
class EnumSet {
public:
constexpr EnumSet() = default;
constexpr EnumSet(std::initializer_list<E> values) {
for (auto e : values) {
set(e);
}
}
constexpr bool has(E e) const { return mData & mask(e); }
constexpr EnumSet& set(E e) { mData |= mask(e); return *this; }
constexpr EnumSet& unset(E e) { mData &= ~mask(e); return *this; }
constexpr EnumSet& operator&=(const EnumSet& other) {
mData &= other.mData;
return *this;
}
constexpr EnumSet& operator|=(const EnumSet& other) {
mData |= other.mData;
return *this;
}
private:
static constexpr std::uint64_t mask(E e) {
return std::uint64_t(1) << e;
}
std::uint64_t mData = 0;
};
This allows you to write simple code:
void apply_known_mask(EnumSet<Flags>& flags) {
static constexpr EnumSet<Flags> IMPORTANT{ B, D, E, H, K, M, L, O };
flags &= IMPORTANT;
}
In C++11, it requires some convolutions, but remains possible nonetheless:
template <typename E>
class EnumSet {
public:
template <E... Values>
static constexpr EnumSet make() {
return EnumSet(make_impl(Values...));
}
constexpr EnumSet() = default;
constexpr bool has(E e) const { return mData & mask(e); }
void set(E e) { mData |= mask(e); }
void unset(E e) { mData &= ~mask(e); }
EnumSet& operator&=(const EnumSet& other) {
mData &= other.mData;
return *this;
}
EnumSet& operator|=(const EnumSet& other) {
mData |= other.mData;
return *this;
}
private:
static constexpr std::uint64_t mask(E e) {
return std::uint64_t(1) << e;
}
static constexpr std::uint64_t make_impl() { return 0; }
template <typename... Tail>
static constexpr std::uint64_t make_impl(E head, Tail... tail) {
return mask(head) | make_impl(tail...);
}
explicit constexpr EnumSet(std::uint64_t data): mData(data) {}
std::uint64_t mData = 0;
};
And is invoked with:
void apply_known_mask(EnumSet<Flags>& flags) {
static constexpr EnumSet<Flags> IMPORTANT =
EnumSet<Flags>::make<B, D, E, H, K, M, L, O>();
flags &= IMPORTANT;
}
Even GCC trivially generates an and
instruction at -O1
godbolt:
apply_known_mask(EnumSet<Flags>&):
and QWORD PTR [rdi], 775946532
ret
The optimization you're looking for seems to be loop peeling, which is enabled at -O3
, or manually with -fpeel-loops
. I'm not sure why this falls under the purview of loop peeling rather than loop unrolling, but possibly it's unwilling to unroll a loop with nonlocal control flow inside it (as there is, potentially, from the range check).
By default, though, GCC stops short of being able to peel all the iterations, which apparently is necessary. Experimentally, passing -O2 -fpeel-loops --param max-peeled-insns=200
(the default value is 100) gets the job done with your original code: https://godbolt.org/z/NNWrga