SIMD latency throughput

The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the result to be ready for a dependent instruction to use it as an input). If you have a loop-carried dependency chain, you can add up the latency of the operations to find the length of the critical path.

If you have independent work in each loop iteration, out-of-order exec can overlap it. The length of that chain (in latency cycles) tells you how much hard OoO exec has to work to overlap multiple instances of that dependency chain.


Normally throughput is the number of instructions per clock cycle, but this is actually reciprocal throughput: the number of clock cycles per independent instruction start - so 0.5 clock cycles means that 2 instructions can be issued in one clock cycle and the result is ready on the next clock cycle.

Note that execution units are pipelined, all but the divider being fully pipelined (start a new instruction every clock cycle). Latency is separate from throughput (how often an independent operation can start). Many instructions are single-uop so their throughput is usually 1/n where n is a small integer (the number of ports with an execution unit that can run that instruction).

Intel documents that here: https://software.intel.com/en-us/articles/measuring-instruction-latency-and-throughput


To find out whether two different instructions compete with each other for the same throughput resource, you need to consult a more detailed guide. For example, https://agner.org/optimize/ has instruction tables and a microarch guide. These go into detail about execution ports, and break down instructions into the three dimensions that matter: front-end cost in uops, which back-end ports, and latency.

For example, _mm_shuffle_epi8 and _mm_cvtsi32_si128 both run on port 5 on most Intel CPUs, so compete for the same 1/clock throughput. But _mm_add_epi32 runs on port 1 or port 5 on Haswell, so its 0.5c throughput only partially competes with shuffles.

https://uops.info/ has very detailed instruction tables from automated testing, including latency from each input separately to the output.

Agner Fog's tables are nice (compact and readable) but sometimes have typos or mistakes, and only a single latency number and you don't always know which input formed the dep chain.

See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?


The following is a quote from Intel's page Measuring Instruction Latency and Throughput.

Latency and Throughput

Latency is the number of processor clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution.

Throughput is the number of processor clocks it takes for an instruction to execute or perform its calculations. An instruction with a throughput of 2 clocks would tie up its execution unit for that many cycles which prevents an instruction needing that execution unit from being executed. Only after the instruction is done with the execution unit can the next instruction enter.