AND faster than integer modulo operation?

You may be interested in Embedded Live: Embedded Programmers' Guide to ARM’s Cortex-M Architecture.

The ARM Cortex-M family has unsigned and singed division instructions, UDIV and SDIV, which take 2 to 12 cycles. There is no MOD instruction, but equivalent result is obtained by a {S,U}DIV followed by the multiply-and-subtract instruction MLS, which takes 2 cycles, for a total of 4-14 cycles.

The AND instruction is single cycle, therefore 4-14x faster.

It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

unsigned page_size_bits = 12;     // optimization works even without const here

unsigned foo(unsigned address) {
  unsigned page_size = 1U << page_size_bits;
  return address / page_size;
}

The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

(Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)

ARM is very generic. There are lot of different ARMs and there are ARMs which do NOT have a division instruction (as Ray Toal already mentioned, modulo is usually implemented as additional result of the division implementation). So if you dont want to call a very slow division subroutine, the logical operation is much faster (and as cyco130 mentioned, any good compiler would recognize it on its own and generate the logical operation on its own - so for clearness of the program code I would stay with the division (except you program assembler, then you have of course to program it yourself, and then you should take the logical operation).

If m is known at compile time (or even it it isn't) integer division and modulo can be re-expressed using multiplication by a magic "multiplicative inverse." The result of the division ends up in the high 32 bits and the remainder (modulus) in the lower 32 bits:

http://www.hackersdelight.org/magic.htm

The following link claims that it is a standard compiler strength reduction:

http://www.flounder.com/multiplicative_inverse.htm

AND faster than integer modulo operation?

Tags:

Assembly

Cpu Architecture

Arm

Micro Optimization

Related

Recent Posts