Why is a CPU branch instruction slow?

Oli gave a very good explanation why branching is expensive: pipeline and branch prediction. I want to add however that you shouldn't be very concerned about the issue as modern compilers will optimize the code and one optimization is reducing branching.

You can read more about C++ optimizations in the Microsoft compiler here - the Profile Guided Optimizer uses runtime information (i.e. which parts of the code are most used) to optimize your code. The speed-up is in the 20% range.

One of the operations is "Conditional Branch Optimization", for example - assuming most of the time i is 6 - this is faster:

if (i==6)
{
    //...
}

else
{
    switch (i)
    {
        case 1: //
        case 2: //
        //...
    }
}

than:

switch (i)
{
    case 1: //
    //...
    case 6: //
    case 7: //
}

Here is a blog post on other optimizations: http://bogdangavril.wordpress.com/2011/11/02/optimizating-your-native-program/


A branch instruction is not inherently slower than any other instruction.

However, the reason you heard that branches should avoided is because modern CPUs follow a pipeline architecture. This means that there are multiple sequential instructions being executed simultaneously. But the pipeline can only be fully utilised if it's able to read the next instruction from memory on every cycle, which in turn means it needs to know which instruction to read.

On a conditional branch, it usually doesn't know ahead of time which path will be taken. So when this happens, the CPU has to stall until the decision has been resolved, and throws away everything in the pipeline that's behind the branch instruction. This lowers utilisation, and therefore performance.

This is the reason that things like branch prediction and branch delay slots exist.


Because CPU adopts pipeline to execute instructions, which means when a previous instruction is being executed at some stage (for example, reading values from registers), the next instruction will get executed at the same time, but at another stage (for example, decoding stage). It is OK for non-control instructions, but it makes thing complex when control instructions like jmp or call are executed.

Since CPU does not know what next instruction will be when executing a jmp instruction, it uses branch prediction techniques to predict whether the branch instruction will be taken or not (For example, a branch instruction in a loop snippet will probably take the instruction flow back to the loop head).

However, when such prediction fails, which is called branch misprediction, it will impact execution performance. Since the pipeline after the branch has to be discarded, and start over from the correct instruction.