Fastest polling loop - how can I trim 1 CPU cycle?
If I understand the question correctly, it's not necessarily the loop cycles that need to be reduced, but the number of cycles between consequent samples (i.e. LDR instructions). But there can be more than one LDR per iteration. You can try something like this:
ldrb r1, [r0]
loop:
cbz r1, out
ldrb r2, [r0]
cbz r2, out
ldrb r1, [r0]
b loop
out:
The spacing between the two LDRB instructions varies so the samples aren't uniformly spaced.
This may delay exit from the loop slightly, but from the problem description I can't say if it's important or not.
I happen to have access to cycle-accurate M7 model, and when the process stabilises your original loop runs on M7 in 3 cycles per iteration (meaning LDR every 3 cycles), while the proposed loop above runs in 4 cycles, but now there are two LDRs in there (so LDR every 2 cycles). Sampling rate is definitely improved.
To give credit, unrolling with CBZ as a break was proposed by @Peter Cordes in a comment.
Admittedly M3 will be slower but it's still worth a shot, if it's the sampling rate you're after.
Also you can check if LDRB instead of LDR (as in the code above) changes anything, although I don't expect it to.
UPD: I have another 2-LDR loop version which on M7 completes in 3 cycles which you can try out of interest (also CBZ breaks allow for easy balancing of the paths after the loop):
ldr r1, [r0]
loop:
ldr r2, [r0]
cbz r1, out_slow
cbz r2, out_fast
ldr r1, [r0]
b loop
out_fast:
/* NOPs as required */
out_slow: