Binary matrix multiplication bit twiddling hack
I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.
// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255; // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;
If you are looking for a way to do dense matrix multiplication in parallel, partition your result matrix into blocks and compute each block in parallel.
http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication
It is not clear what data structure you are using, which language (yes, I know you said 'any language'), and what you are trying to optimize (speed? memory?) etc. All of these may have profound impact on your solution.
Some examples:
- Say this was C/C++, and your matrices are continues bits in memory. Each row/column maps to a UINT8. In this case, multiplying a row with a column reduces to doing an 8-bit bitwise-&, and checking if the result is greater than 0 (no need to sum the bits). This takes 2 processor instruction.
- If you are forced to do bit-by-bit operations, use the bitwise 'or' (
|
) instead of+
. Some languages may lazy evaluate this, stopping at the first '1' they encounter. - If you can multi-thread, you could speedup calculations.
BTW, I'm assuming you have a lot of matrices to process, otherwise I would use a direct, and readable code. My guess is that even with a lot of matrices, the gain in performance would be negligible.