Does a branch misprediction flush the entire pipeline, even for very short if-statement body?
Most general purpose processors do flush the pipeline on a branch misprediction. The negative performance impact of conditional branches has motivated proposals for eager execution (where both paths are executed and the correct path selected later) and dynamic predication (where instructions in the branch shadow are predicated) in addition to extensive research on branch prediction (as well as other techniques). (Mark Smotherman's page on eager execution provides some details and references. I would add Hyesoon Kim et al.'s "Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution", 2005, as a significant paper.)
IBM's POWER7 seems to be the first mainstream processor to implement anything more sophisticated than prefetching an alternate path (i.e., eager fetch), and it only handles the single instruction case. (POWER7 uses a branch prediction confidence estimate to choose whether to predicate or use prediction.)
Eager execution has the obvious problem of exploding resource use. Even with selective eagerness based on branch prediction confidence, speculation depth, and resource availability (information available to the front-end), it can easily be more effective to speculate deeper down a single path. Discovering the joining points of multiple paths and avoiding excessive redundant computation can also add complexity. (Ideally, control independent operations would only be executed once and joining and data flow would be optimized, but such optimization adds complexity.)
For a deeply pipelined in-order processor, it may seem attractive to predict short forward branches as not taken and only flush backward in the pipeline to the instruction targeted by the taken branch when the branch is actually taken. If only one such branch is allowed in the pipeline at a time (other branches uses prediction), adding a single bit to each instruction could control whether it is converted to a nop or executed. (If only the case of a single instruction being branched over is handled, allowing multiple branches in the pipeline might not be especially complex.)
This would be similar to annul-if-taken branch delay slots. MIPS has "Branch Likely" instructions that annulled if not taken, and these are marked as obsolete in Revision 2.62. While some of the justification for such is presumably to separate implementation from interface and the desire to recover instruction encoding space, this decision also hints that the concept has some issues.
If this was done for all short forward branches, it would throw away instructions when the branch was correctly predicted as taken. (Note that this penalty could be less if taken branches always experience a delay in fetch redirection, which would be more likely with a multi-cycle instruction cache access in a deeply pipelined processor. In that case, fetching as if there was no branch could have the same performance as a correctly predicted taken branch. However, one could argue that the processor special case such short taken branches to minimize such fetch bubbles.)
As an example consider a scalar pipeline (non-branch instructions per cycle equal to 1.0) with branch resolution at the end of the eighth stage and no fetch redirection penalty on correctly predicted taken branches, handling single-instruction branch-overs. Assume 75% branch predictor accuracy (unbiased by direction) for such short forward branches (2% of instructions, taken 30% of the time) and 93% accuracy for other branches (18% of instructions). Eight cycles would be saved for short branches that would be mispredicted as taken (17.5% of such branches; 0.35% of instructions), seven cycles when mispredicted as not taken (7.2%; 0.144%), and one cycle would be lost when correctly predicted as taken (22.5%; 0.45%). In total 0.03358 cycles per instruction would be saved. Without this optimization the cycles per instruction would be 1.2758.
(While the above numbers are just for example, they are probably not far from reality except for the 1.0 IPC for non-branch instructions. Providing a small loop cache would reduce the misprediction penalty (and save power in short loops) because instruction cache access would probably be three of the eight cycles. Adding the effect of cache misses would further reduce the percentage improvement from this branch optimization. Avoiding the overhead for predicted "strongly taken" short branches might be worthwhile.)
In order designs tend to use narrow and shallower pipelines and prefer simplicity (for lower design, power, and area costs). Since the instruction set is likely to support branchless code for many short-branch cases, the incentive to optimize this aspect is further decreased.
For out-of-order implementations, the potentially branched over instructions would have to be predicated since the processor would want to be able to execute later non-dependent instructions. Predication introduces an additional data dependency which must be checked for scheduling. It is common for instruction schedulers to provide only two comparators per instruction and to split a conditional move (a simple instruction with only three data flow operands: the old value, the alternative value, and the condition; a predicated register-register add would have four operands. (There are alternative ways of addressing this issue, but this answer is already long.)
An out-of-order implementation would also not stall when a branch condition is not available. This is a tradeoff between a control dependency and a data dependency. With accurate branch prediction a control dependency is extremely inexpensive, but a data dependency can hold up forward progress waiting on data operands. (Of course, with a boolean data dependency, value prediction becomes somewhat more attractive. Using predicate prediction might be desirable in some cases and would have the advantage over simple predication of using dynamic cost and confidence estimates.)
(It is perhaps telling that ARM chose to drop extensive predication in 64-bit AArch64. While a large part of this is for instruction encoding, the benefit of predication for high-performance implementations is presumably relatively low.)
Compiler issues
The performance of branchless versus branching code depends on the predictability of the branch and other factors (including, if taken, any penalty for redirecting fetch), but it is difficult for the compiler to determine the predictability of a branch. Even profile data typically only provides branch frequencies which can give a pessimistic view of predictability since such does not account for the branch predictor using local or global history. A compiler is also not perfectly aware of timing of data availability and other dynamic aspects. If the condition is available later than the operands used for computation, then replacing a control dependence (branch prediction) with a data dependence (predication) could degrade performance. Branchless code may also introduce more live values, potentially adding register spill and fill overhead.
Complicating this further, most instruction sets that only provide conditional move or select instructions do not provide a conditional store. While this can be worked around by using conditional move to select a safe, ignored storage location, such seems an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; an addition and conditional move might take three cycles where a correctly predicted branch and addition would take zero (if addition is branched over) or one cycle.
A further complication is that predicated operations are generally ignored by the branch predictor. If a later retained branch correlates with the condition of the removed branch, the branch misprediction rate may increase for that later branch. (Predicate prediction could be used to retain the predictor effects of such removed branches.)
With the increased emphasis on vectorization, the use of branchless code becomes even more significant since branch-based code constrains the ability to use operations on an entire vector.
Modern high-performance out-of-order CPUs usually do not flush the entire pipeline0 on a misprediction, but it doesn't really depend on the distance of the branch or work as you suggest.
They generally use something similar to the strategy of flushing the branch instruction and all younger instructions. The front-end is flushed, this this will be full of instructions on the mispredicted path, but beyond the front-end modern cores may have more than 100 instructions in-flight at once, only some of which may be younger than the branch.
This means that the cost of the branch is at least partly related to the surrounding instructions: if the branch condition can be checked early the impact of a mis-prediction can be limited or even zero1. On the other hand, if the branch condition is handled late, after considerable resources have been spent on the wrong path, the cost can be large (e.g., larger than the 12-20 cycle "published" branch misprediction penalty you'll often see).
0 The exact terminology is up for debate here: the meaning of flushing the pipeline isn't entirely clear for out-of-order processors. Here I mean that the CPU does not flush all in-flight-but-possibly-not-executed instructions.
1 In particular, the limiting factor for some sequence of instructions could be a dependency chain whose current execution is far enough behind the leading edge of the instruction window that the misprediction doesn't flush any of those instructions and doesn't slow down the code at all.
"If it mispredicted, and the branch was actually taken, then the CPU really only has to discard 1 instruction from the pipeline (the one in the if-body)."
That's not as easy as you make it sound. Instructions modify various different states in the architecture on which other instructions rely on (register results, condition flags, memory, etc). By the time you realize you've mis-predicted, you could potentially have tons of instructions in the pipeline that have started execution based on state changed by that instructions and all subsequent instructions in the pipeline... Not to mention instructions that can raise faults/exceptions.
A simple example:
b = 0
f (a == 0) {
b = 1;
}
c = b * 10;
if (b == 0)
printf("\nc = %d.",c);
foo(b);
etc..
To undo that "one simple instruction" would take a lot of work.
For simple branches with poor predictability, predication/cmovs/etc are preferred.