How can a CPU deliver more than one instruction per cycle?
First, as Keelan's comment and Turbo J's answer point out, the measurement was 113,093 Dhrystone MIPS not native MIPS.
The Ivy Bridge microarchitecture of the i7 3630QM can only commit 4 fused µops per cycle, though it can begin execution of 6 µops per cycle. (The number of fused µops in a trace of code is roughly equal to the number of instructions; some complex instructions are decoded into multiple µops that are not fused and some pairs of instructions can be fused into a single µop, e.g., a compare immediately followed by a conditional jump.)
Two of your speculations on how multiple instructions can be executed in a single cycle are quite valid and have been used in actual processors. Your first speculation, that a faster internal clock is used, was used in the original Pentium 4's "fireball" ALUs. These ALUs were clocked at twice the frequency of the rest of the core, which was already relatively high.
(This was accomplished by using a staggered ALU in which the lower half of an addition was done in one cycle, allowing a dependent operation to use the lower half of the result in the next cycle. For operations like add, xor, or left shift which only need the lower half of operands to produce the full lower half of the result, such staggering—also known as width-pipelining—allows single cycle result latency as well as single cycle throughput.)
A somewhat related technique, cascaded ALUs, was used by the HyperSPARC. The HyperSPARC fed the results from two ALUs into a third ALU. This allowed two independent and a third dependent operation to be executed in a single cycle.
Your speculation that "there are multiple concurrent pipelines per core" is the other technique that has been used. This type of design is called superscalar and is by far the most common means of increasing the number of operations executed in a single cycle.
There are also a few other odds and ends of instruction execution that might be worth noting. Some operations can be more efficiently performed outside of the ordinary execution units. The technique of move elimination exploits the use of register renaming in out-of-order processors to perform move operations during register renaming; the move simply copies the physical register number from one position in the renaming table (called a register alias table) to another. Not only does this effectively increase execution width but it also removes a dependency. This technique was used early with the stack-based x87, but is now broadly used in Intel's high performance x86 processors. (The use of destructive, two-operand instructions in x86 makes move elimination more helpful than it would be in a typical RISC.)
A technique similar to move elimination is the handling of register zeroing instructions during renaming. By providing a register name that provides the zero value, a register clearing instruction (like xor or subtract with the both operands being the same register) can simply insert that name into the renaming table (RAT).
Another technique used by some x86 processors reduces the cost of push and pop operations. Ordinarily an instruction using the stack pointer would have to wait a full cycle for a previous push or pop to update the value for the stack pointer. By recognizing that push and pop only add or subtract a small value to the stack pointer, one can compute the results of multiple additions/subtactions in parallel. The main delay for addition is carry propagation, but with small values the more significant bits of the base value—in this case the stack pointer—will only have at most one carry-in. This allows an optimization similar to that of a carry-select adder to be applied to multiple additions of small values. In addition, since the stack pointer is typically only updated by constants, these updates can be performed earlier in the pipeline separately from the main execution engine.
It is also possible to merge instructions into a single, more complex operation. While the reverse process of splitting instructions into multiple, simpler operations is an old technique, merging instructions (which Intel terms macro-op fusion) can allow the implementation to support operations more complex than those exposed in the instruction set.
On the theoretical side, other techniques have been proposed. Small constants other than zero could be supported in the RAT and some simple operations that use or reliably produce such small values could be handled early. ("Physical Register Inlining", Mikko H. Lipasti et al., 2004, suggested using the RAT as a means of reducing register count, but the idea could be extended to support loading small immediates and simple operations on small numbers.)
For trace caches (which store sequences of instructions under particular assumptions of control flow), there can be opportunities to merge operations separated by branches and remove operations that produce unused results in the trace. The caching of the optimizations in a trace cache can also encourage performing optimizations such as instruction merging which might not be worthwhile if they had to be done each time the instruction stream was fetched.
Value prediction can be used to increase the number of operations that can be executed in parallel by removing dependencies. A stride-based value predictor is similar to the pop/push optimization of a specialized stack engine mentioned earlier. It can compute multiple additions mostly in parallel, removing the serialization. The general idea of value prediction is that with a predicted value, dependent operations can proceed without delay. (Branch direction and target prediction is effectively just a very limited form of value prediction, allowing the fetching of following instructions which are dependent on the "value" of the branch—taken or not—and the next instruction address, another value.)
Some dark magic happens on the inside of modern processors, but your thoughts are definitely along the right lines.
The key to understanding the efficiency of modern processors is realising that they are superscalar. From Wikipedia (emphasis mine):
A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate.
These modern processors have multiple execution units per core, as you guessed. Hyper-threading is interesting to consider, some parts of the pipeline are duplicated but some aren't.
Out-of-order execution is also interesting to read about, but does not directly answer your question. It does reduce the number of "wasted" CPU cycles though.
Efficiency is also affected by many other things that can cause a stall inside the processor, including (but definitely not limited to):
- The results of previous instructions not being available.
- Cache misses.
- The code branching, which would invalidate already fetched instructions (read about branch prediction here and here).
Modern compilers try to help with many of the above items, the processor then takes over. For a good example see this question elsewhere on Stackexchange, which highlights an important difference between two instructions that can do the same thing (in some circumstances). However one can be "quicker" than the other on some processors due to the execution unit in use.
For a human readable explanation of the modern CPU pipeline, see A journey through the CPU pipeline. For a somewhat more technical explanation see Agner Fog's Microarchitecture paper.
What do you think happened: All the engineers at Intel, AMD and IBM read that a pipeline can only deliver one result per cycle, and they said "oh well, that's it then, can't make these processors any faster". Or did they read this and said: "Can't deliver more than one result per cycle? We'll see about that!".
For a good introduction to the Haswell architecture for example you can follow this link http://www.realworldtech.com/haswell-cpu/ or you can just go to the Intel website and you will find a bit of documentation there.
Each core of the Haswell processor has a huge number of execution units, which can perform operations independent of each other, so multiple operations can be performed in parallel. Next, the Haswell processor has several execution units that handle vector operations up to 256 bit in size. A vector operation could for example do four double precision floating point operations or eight single precision floating point operations in one vector operation. And finally, the Haswell processor supports "fused multiply-add", which means that calculating a times b plus c is only a single operation.
The theoretical maximum, since Haswell has two units capable of fused multiply-add, is two fused multiply-add operations per cycle, each operation doing eight single-precision multiplications plus additions, or 32 single precision floating-point operations.
The 3630 processor is not in Intel's latest price list, but there are models like the 3740QM with four cores. So instead of 32, you can get 128 floating-point operations per clock cycle. This is the theoretical maximum. Achieving half of that in real life is a challenge, but not impossible for suitable tasks. There are other processor available with up to 15 cores (for prices that not even the most fanatic gaming fanatics will pay).
So you have a combination of several multipliers:
- Multiple cores per processor.
- (Hyperthreading, not mentioned before, allows you to get closer to theoretical limits)
- Fused multiply-add operation does two arithmetic operations counting only as one.
- 256-bit vectors doing 8 operations counting only as one.
- Two vector execution units capable of handling fused-multiply add.
8.6 operations per cycle isn't too difficult to achieve. Even 8.6 operations per cycle per core isn't too difficult.