Why does this delay-loop start to run faster after several iterations with no sleep?
After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.
If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.
If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)
Introducing a usleep
prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)
comments on other theories:
re: David's theory that a potential context switch from usleep
could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.
Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int
of stack memory. Any potential cache pollution during usleep
is a tiny fraction of the time for this code (real code will be different)!
In more detail for x86:
The call to clock()
itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock()
will almost never be delayed, because it should still be hot in cache.
The run
function may be in a different cache line from main
(since gcc marks main
as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main
will have triggered the potential TLB miss before entering the timed region of the program.
gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add
with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration
is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call
instruction).
Even if the loop-counter was kept in a register, 100k cycles is a lot.
A call to usleep
may or may not result in a context switch. If it does, it will take longer than if it doesn't.