Why are newer generations of processors faster at the same clock speed?
Designing a processor to deliver high performance is far more than just increasing the clock rate. There are numerous other ways to increase performance, enabled through Moore's law and instrumental to the design of modern processors.
Clock rates can't increase indefinitely.
At first glance, it may seem that a processor simply executes a stream of instructions one after another, with performance increases attained through higher clock rates. However, increasing clock rate alone isn't enough. Power consumption and heat output increase as clock rates go up.
With very high clock rates, significant increase in CPU core voltage become necessary. Because TDP increases with the square of the Vcore, we eventually reach a point where excessive power consumption, heat output, and cooling requirements prevent further increases in clock rate. This limit was reached in 2004, in the days of the Pentium 4 Prescott. While recent improvements in power efficiency have helped, significant increases in clock rate are no longer feasible. See: Why have CPU manufacturers stopped increasing the clock speeds of their processors?
Graph of stock clock speeds in cutting-edge enthusiast PCs over the years. Image source
- Through Moore's law, an observation which states that the number of transistors on an integrated circuit doubles every 18 to 24 months, primarily as a result of die shrinks, a variety of techniques which increase performance have been implemented. These techniques have been refined and perfected over the years, enabling more instructions to be executed over a given period of time. These techniques are discussed below.
Seemingly sequential instruction streams can often be parallelized.
- Although a program may simply consist of a series of instructions to execute one after another, these instructions, or parts thereof, can very often be executed simultaneously. This is called instruction-level parallelism (ILP). Exploiting ILP is vital to attaining high performance, and modern processors use numerous techniques to do so.
Pipelining breaks instructions into smaller pieces which can be executed in parallel.
Each instruction can be broken down into a sequence of steps, each of which is executed by a separate part of the processor. Instruction pipelining allows multiple instructions to go through these steps one after another without having to wait for each instruction to finish completely. Pipelining enables higher clock rates: by having one step of each instruction complete in each clock cycle, less time would be needed for each cycle than if entire instructions had to be completed one at a time.
The classic RISC pipeline contains five stages: instruction fetch, instruction decode, instruction execution, memory access, and writeback. Modern processors break execution down into many more steps, producing a deeper pipeline with more stages (and increasing attainable clock rates as each stage is smaller and takes less time to complete), but this model should provides a basic understanding of how pipelining works.
Image source
However, pipelining can introduce hazards which must be resolved to ensure correct program execution.
Because different parts of each instruction are being executed at the same time, it is possible for conflicts to occur which interfere with correct execution. These are called hazards. There are three types of hazards: data, structural, and control.
Data hazards occur when instructions read and modify the same data at the same time or in the wrong order, potentially leading to incorrect results. Structural hazards occur when multiple instructions need to use a particular part of the processor at the same time. Control hazards occur when a conditional branch instruction is encountered.
These hazards may be resolved in various ways. The simplest solution is to simply stall the pipeline, temporarily putting execution of one or instructions in the pipeline on hold to ensure correct results. This is avoided whenever possible because it reduces performance. For data hazards, techniques such as operand forwarding are used to reduce stalls. Control hazards are handled through branch prediction, which requires special treatment and is covered in the next section.
Branch prediction is used to resolve control hazards which can disrupt the entire pipeline.
Control hazards, which occur when a conditional branch is encountered, are particularly serious. Branches introduce the possibility that execution will continue elsewhere in the program rather than simply the next instruction in the instruction stream, based on whether a particular condition is true or false.
Because the next instruction to execute cannot be determined until the branch condition is evaluated, it is not possible to insert any instructions into the pipeline after a branch in the absence. The pipeline is therefore emptied (flushed) which can waste nearly as many clock cycles as there are stages in the pipeline. Branches tend to occur very often in programs, so control hazards can severely impact processor performance.
Branch prediction addresses this issue by guessing whether a branch will be taken. The simplest way to do this is simply to assume that branches are always taken or never taken. However, modern processors use much more sophisticated techniques for higher prediction accuracy. In essence, the processor keeps track of previous branches and uses this information in any of several ways to predict the next instruction to execute. The pipeline can then be fed with instructions from the correct location based on the prediction.
Of course, if the prediction is wrong, whatever instructions were put through the pipeline after the branch must be dropped, thereby flushing the pipeline, and any changes made by these instructions must be reverted. As a result, the accuracy of the branch predictor becomes increasingly critical as pipelines get longer and longer. Specific branch prediction techniques are beyond the scope of this answer.
Caches are used to speed up memory accesses.
Modern processors can execute instructions and process data far faster than they can be accessed in main memory. When the processor must access RAM, execution can stall for long periods of time until the data is available. To mitigate this effect, small high-speed memory areas called caches are included on the processor.
Because of the limited space available on the processor die, caches are of very limited size. To make the most of this limited capacity, caches store only the most recently or frequently accessed data (temporal locality). As memory accesses tend to be clustered within particular areas (spatial locality), blocks of data near what is recently accessed are also stored in the cache. See: Locality of reference
Caches are also organized in multiple levels of varying size to optimize performance as larger caches tend to be slower than smaller caches. For example, a processor may have a level 1 (L1) cache which is only 32 KB in size, while its level 3 (L3) cache can be several megabytes large. The size of the cache, as well as the associativity of the cache, which affects how the processor manages the replacement of data on a full cache, significantly impact the performance gains that are obtained through a cache.
Out-of-order execution reduces stalls due to hazards by allowing independent instructions to execute first.
Not every instruction in an instruction stream depends on each other. For example, although
a + b = c
must be executed beforec + d = e
,a + b = c
andd + e = f
are independent and can be executed at the same time. Furthermore, when an instruction in progress is waiting on a memory access or other delay, subsequent instructions aren't necessarily dependent on that delayed operation.Out-of-order execution takes advantage of this fact to allow other, independent instructions to execute while one instruction is stalled. Instead of requiring instructions to execute one after another in lockstep, scheduling hardware is added to allow independent instructions to be executed in any order. Instructions are dispatched to a instruction queue and issued to the appropriate part of the processor when the required data becomes available. That way, instructions that are stuck waiting for data from memory or an earlier instruction do not tie up later instructions that are independent.
Image source
- Several new and expanded data structures are required to perform out-of-order execution. The aforementioned instruction queue, the reservation station, is used to hold instructions until the data required for execution becomes available. The re-order buffer (ROB) is used to keep track of the state of instructions in progress, in the order in which they were received, so that instructions are completed in the correct order. A register file which extends beyond the number of registers provided by the architecture itself is needed for register renaming, which helps prevent otherwise independent instructions from becoming dependent due to the need to share the limited set of registers provided by the architecture.
Superscalar architectures allow multiple instructions within an instruction stream to execute at the same time.
The techniques discussed above only increase the performance of the instruction pipeline. These techniques alone do not allow more than one instruction to be completed per clock cycle. However, it is often possible to execute individual instructions within an instruction stream in parallel, such as when they do not depend on each other (as discussed in the out-of-order execution section above).
Superscalar architectures take advantage of this instruction-level parallelism by allowing instructions to be sent to multiple functional units at once. The processor may have multiple functional units of a particular type (such as integer ALUs) and/or different types of functional units (such as floating-point and integer units) to which instructions may be concurrently sent.
In a superscalar processor, instructions are scheduled as in an out-of-order design, but there are now multiple issue ports, allowing different instructions to be issued and executed at the same time. Expanded instruction decoding circuitry allows the processor to read several instructions at a time in each clock cycle and determine the relationships among them. A modern high-performance processor can schedule up to eight instructions per clock cycle, depending on what each instruction does. This is how processors can complete multiple instructions per clock cycle. See: Haswell execution engine on AnandTech
Image source
- However, superscalar architectures are very difficult to design and optimize. Checking for dependencies among instructions requires very complex logic whose size can scale exponentially as the number of simultaneous instructions increases. Also, depending on the application, there is only a limited number of instructions within each instruction stream that can be executed at the same time, so efforts to take greater advantage of ILP suffer from diminishing returns.
More advanced instructions are added which perform complex operations in less time.
As transistor budgets increase, it becomes possible to implement more advanced instructions that allow complex operations to be performed in a fraction of the time they would otherwise take. Examples include vector instruction sets such as SSE and AVX which perform computations on multiple pieces of data at the same time and the AES instruction set which accelerates data encryption and decryption.
To perform these complex operations, modern processors use micro-operations (μops). Complex instructions are decoded into sequences of μops, which are stored inside a dedicated buffer and scheduled for execution individually (to the extent allowed by data dependencies). This provides more room to the processor to exploit ILP. To further enhance performance, a special μop cache can be used to store recently decoded μops, so that the μops for recently executed instructions can be looked up quickly.
However, the addition of these instructions does not automatically boost performance. New instructions can increase performance only if an application is written to use them. Adoption of these instructions is hampered by the fact that applications using them will not work on older processors which do not support them.
So how do these techniques improve processor performance over time?
Pipelines have become longer over the years, reducing the amount of time needed to complete each stage and therefore enabling higher clock rates. However, among other things, longer pipelines increase the penalty for an incorrect branch prediction, so a pipeline can't be too long. In trying to reach very high clock speeds, the Pentium 4 processor used very long pipelines, up to 31 stages in Prescott. To reduce performance deficits, the processor would try to execute instructions even if they might fail, and would keep trying until they succeeded. This led to very high power consumption and reduced the performance gained from hyper-threading. Newer processors no longer use pipelines this long, especially since clock rate scaling has reached a wall; Haswell uses a pipeline which varies between 14 and 19 stages long, and lower-power architectures use shorter pipelines (Intel Atom Silvermont has 12 to 14 stages).
The accuracy of branch prediction has improved with more advanced architectures, reducing the frequency of pipeline flushes caused by misprediction and allowing more instructions to be executed concurrently. Considering the length of pipelines in today's processors, this is critical to maintaining high performance.
With increasing transistor budgets, larger and more effective caches can be embedded in the processor, reducing stalls due to memory access. Memory accesses can require more than 200 cycles to complete on modern systems, so it is important to reduce the need to access main memory as much as possible.
Newer processors are better able to take advantage of ILP through more advanced superscalar execution logic and "wider" designs that allow more instructions to be decoded and executed concurrently. The Haswell architecture can decode four instructions and dispatch 8 micro-operations per clock cycle. Increasing transistor budgets allow more functional units such as integer ALUs to be included in the processor core. Key data structures used in out-of-order and superscalar execution, such as the reservation station, reorder buffer, and register file, are expanded in newer designs, which allows the processor to search a wider window of instructions to exploit their ILP. This is a major driving force behind performance increases in today's processors.
More complex instructions are included in newer processors, and an increasing number of applications use these instructions to enhance performance. Advances in compiler technology, including improvements in instruction selection and automatic vectorization, enable more effective use of these instructions.
In addition to the above, greater integration of parts previously external to the CPU such as the northbridge, memory controller, and PCIe lanes reduce I/O and memory latency. This increases throughput by reducing stalls caused by delays in accessing data from other devices.
It's not because of newer instructions usually. It's just because the processor requires fewer instruction cycles to execute the same instructions. This can be for a large number of reasons:
Large caches mean less time wasted waiting for memory.
More execution units means less time waiting to start operating on an instruction.
Better branch prediction means less time wasted speculatively executing instructions that never actually need to be executed.
Execution unit improvements mean less time waiting for instructions to complete.
Shorter pipelines means pipelines fill up faster.
And so on.
The absolute definitive reference is the Intel 64 and IA-32 Architectures Software Developer Manuals. They detail the changes between architectures, and they're a great resource to understand the x86 architecture.
I would recommend that you download the combined volumes 1 through 3C (first download link on that page). Volume 1 Chapter 2.2 has the information you want.
Some general differences I see listed in that chapter, going from the Core to the Nehalem/Sandy Bridge microarchitectures are:
- improved branch prediction, quicker recovery from misprediction
- HyperThreading Technology
- integrated memory controller, new cache hirearchy
- faster floating-point exception handling (Sandy Bridge only)
- LEA bandwidth improvement (Sandy Bridge only)
- AVX instruction extensions (Sandy Bridge only)
The complete list can be found in the link provided above (Vol. 1, Ch. 2.2).