How does MIPS I handle branching on the previous ALU instruction without stalling?
TL:DR: Classic MIPS I checks branch conditions in the first half cycle of EX, so forwarding to them is not special.
IF only needs the address in the 2nd half of a cycle so EX can forward to it.
These factors combine to give only 1 cycle of branch latency (hidden by 1 delay slot), with no problem for branches that depend the previous ALU instruction.
It was definitely safe to run sltu
/ beq
on MIPS I (R2000). That's listed as the expansion for the bgeu
pseudo-instruction, for example, in real MIPS manuals and books with no caveat about it being unsafe on MIPS R2000 or any other MIPS.
GCC uses sequences like that in practice even with march=mips1
which respects load-delay slots and other features of real MIPS R2000.
MIPS's IF doesn't need an address until the 2nd half of a clock cycle, allowing EX to produce it quickly enough.
From See MIPS Run by Dominic Sweetman, (covering MIPS I through MIPS IV), Chapter 1.5.1 Constraints on Instructions
We’ll see later that efficient conditional branching means that the decision about whether to branch or not has to be squeezed into only half a pipeline stage; the architecture helps out by keeping the branch decision tests very simple. So conditional branches (in MIPS) test a single register for sign/zero or a pair of registers for equality.
Their Figure 1.3: The pipeline and branch delays shows the branch condition being calculated in the first half of EX, and used in the 2nd half of IF, for a total branch latency of only 1 cycle / pipeline stage (ID) / instruction. IF doesn't actually start until the 2nd half of a clock cycle. (And continues into ID. The actual decode/register-fetch of ID only takes the last fraction of a clock cycle.)
That has the same end result as what I suggested in the question (check branch condition by the end of ID), except it only requires EX -> EX forwarding to branch on the result of the previous ALU instruction.
Perhaps I was misremembering or misinterpreting something I'd read previously about the half-cycle branch-decision. This half-cycle thing might well be exactly what I remembered seeing.
Further quoting See MIPS Run 1.5.5 Programmer-Visible Pipeline Effects
• Delayed branches: [first paragraph explains the branch-delay slot]
If nothing special was done by the hardware, the decision to branch or not, together with the branch target address, would emerge at the end of the ALU pipestage — in time to fetch the branch target instruction instead of the next instruction but two. But branches are important enough to justify special treatment, and you can see from Figure 1.3 [described above] that a special path is provided through the ALU to make the branch address available half a clock cycle early. Together with the odd half-clock-cycle shift of the instruction fetch stage, that means that the branch target can be fetched in time to become the next but one, so the hardware runs the branch instruction, then the branch delay slot instruction, and then the branch target — with no other delays.
... [don't waste your branch-delay slots]
... [many MIPS assemblers will reorder instructions for you if it's safe, to hide branch delay]
See MIPS Run has a foreword by John L. Hennessy, Founder of MIPS Technologies etc. etc.. That's not proof he signed off on everything in the book being accurate, but it's good evidence that the book's description of how MIPS managed this trick is accurate.
It's easily understandable and 100% plausible; we already know the data cache has single-cycle fetch latency (after address-generation in the EX stage).