Why is a simple loop optimized when the limit is 959 but not 960?

After reading Sulthan's comment, I guess that:

  1. The compiler fully unrolls the loop if the loop counter is constant (and not too high)

  2. Once it's unrolled, the compiler sees that the sum operations can be grouped into one.

If the loop isn't unrolled for some reason (here: it would generate too many statements with 1000), the operations cannot be grouped.

The compiler could see that the unroll of 1000 statements amounts to a single addition, but step 1 & 2 described above are two separate optimizations, so it cannot take the "risk" of unrolling, not knowing if the operations can be grouped (example: a function call cannot be grouped).

Note: This is a corner case: Who uses a loop to add the same thing over again? In that case, don't rely on the compiler possible unroll/optimise; directly write the proper operation in one instruction.


Very good question!

You seem to have hit a limit on the number of iterations or operations the compiler tries to inline when simplifying the code. As documented by Grzegorz Szpetkowski, there are compiler specific ways to tweak these limits with pragmas or command line options.

You can also play with Godbolt's Compiler Explorer to compare how different compilers and options impact the code generated: gcc 6.2 and icc 17 still inline the code for 960, whereas clang 3.9 does not (with the default Godbolt configuration, it actually stops inlining at 73).


TL;DR

By default, the current snapshot GCC 7 behaves inconsistently, while previous versions have default limit due to PARAM_MAX_COMPLETELY_PEEL_TIMES, which is 16. It can be overridden from command-line.

The rationale of the limit is to prevent too aggressive loop unrolling, that can be a double-edged sword.

GCC version <= 6.3.0

The relevant optimization option for GCC is -fpeel-loops, which is enabled indirectly along with flag -Ofast (emphasis is mine):

Peels loops for which there is enough information that they do not roll much (from profile feedback or static analysis). It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations).

Enabled with -O3 and/or -fprofile-use.

More details can be obtained by adding -fdump-tree-cunroll:

$ head test.c.151t.cunroll 

;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)

Not peeling: upper bound is known so can unroll completely

The message is from /gcc/tree-ssa-loop-ivcanon.c:

if (maxiter >= 0 && maxiter <= npeel)
    {
      if (dump_file)
        fprintf (dump_file, "Not peeling: upper bound is known so can "
         "unroll completely\n");
      return false;
    }

hence try_peel_loop function returns false.

More verbose output can be reached with -fdump-tree-cunroll-details:

Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely

It is possible to tweak the limits by plaing with max-completely-peeled-insns=n and max-completely-peel-times=n params:

max-completely-peeled-insns

The maximum number of insns of a completely peeled loop.

max-completely-peel-times

The maximum number of iterations of a loop to be suitable for complete peeling.

To learn more about insns, you can refer to GCC Internals Manual.

For instance, if you compile with following options:

-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000

then code turns into:

f:
        vmovss  xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1148207104

Clang

I am not sure what Clang actually does and how to tweak its limits, but as I observed, you could force it to evaluate the final value by marking the loop with unroll pragma, and it will remove it completely:

#pragma unroll
for (int i = 0; i < 960; i++)
    p++;

results into:

.LCPI0_0:
        .long   1148207104              # float 961
f:                                      # @f
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        ret