What is locality of reference?

First of all, note that these concepts are not universal laws, they are observations about common forms of code behavior that allow CPU designers to optimize their system to perform better over most of the programs. At the same time, these are properties that programmers seek to adopt in their programs as they know that's how memory systems are built and that's what CPU designers optimize for.

Spatial locality refers to the property of some (most, actually) applications to access memory in a sequential or strided manner. This usually stems from the fact that the most basic data structure building blocks are arrays and structs, both of which store multiple elements adjacently in memory. In fact, many implementations of data structures that are semantically linked (graphs, trees, skip lists) are using arrays internally to improve performance.

Spatial locality allows a CPU to improve the memory access performance thanks to:

Memory caching mechanisms such as caches, page tables, memory controller page are already larger by design than what is needed for a single access. This means that once you pay the memory penalty for bringing data from far memory or a lower level cache, the more additional data you can consume from it the better is your utilization.
Hardware prefetching which exists on almost all CPUs today often covers spatial accesses. Everytime you fetch addr X, the prefetcher will likely fetch the next cache line, and possibly others further ahead. If the program exhibits a constant stride, most CPUs would be able to detect that as well and extrapolate to prefetch even further steps of the same stride. Modern spatial prefetchers may even predict variable recurring strides (e.g. VLDP, SPP)

Temporal locality refers to the property of memory accesses or access patterns to repeat themselves. In the most basic form this could mean that if address X was once accessed it may also be accessed in the future, but since caches already store recent data for a certain duration this form is less interesting (although there are mechanisms on some CPUs aimed to predict which lines are likely to be accessed again soon and which are not).

A more interesting form of temporal locality is that two (or more) temporally adjacent accesses observed once, may repeat together again. That is - if you once accessed address A and soon after that address B, and at some later point the CPU detects another access to address A - it may predict that you will likely access B again soon, and proceed to prefetch it in advance. Prefetchers aimed to extract and predict this type of relations (temporal prefetchers) are often using relatively large storage to record many such relations. (See Markov prefetching, and more recently ISB, STMS, Domino, etc..)

By the way, these concepts are in no way exclusive, and a program can exhibit both types of localities (as well as other, more irregular forms). Sometimes both are even grouped together under the term spatio-temporal locality to represent the "common" forms of locality, or a combined form where the temporal correlation connects spatial constructs (like address delta always following another address delta).

It is a principle which states that if some variables are referenced by a program, it is highly likely that the same might be referenced again (later in time - also known as temporal locality) .
It is also highly likely that any consecutive storage in memory might be referenced sooner (spatial locality)

enter image description here

This would not matter if your computer was filled with super-fast memory.

But unfortunately that's not the case and computer-memory looks something like this¹:

+----------+
| CPU      |  <<-- Our beloved CPU, superfast and always hungry for more data.
+----------+
|L1 - Cache|  <<-- ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput
+----------+
|L2 - Cache|  <<-- ~12 CPU-cycles access latency (fast)
+----+-----+
     |
+----------+
|L3 - Cache|  <<-- ~35 CPU-cycles access latency (medium)
+----+-----+       (usually shared between CPU-cores)
     |
     |   <<-- This thin wire is the memory bus, it has limited bandwidth.
+----+-----+
| main-mem |  <<-- ~100 CPU-cycles access latency (slow)
+----+-----+  <<-- The main memory is big but slow (because we are cheap-skates)
     |
     |   <<-- Even slower wire to the harddisk
+----+-----+
| harddisk | <<-- Works at 0,001% of CPU speed
+----------+

Spatial Locality
In this diagram, the closer data is to the CPU the faster the CPU can get at it.
This is related to Spacial Locality. Data has spacial locality if it is located close together in memory.
Because of the cheap-skates that we are RAM is not really Random Access, it is really Slow if random, less slow if accessed sequentially Access Memory SIRLSIAS-AM. DDR SDRAM transfers a whole burst of 32 or 64 bytes for one read or write command.
That is why it is smart to keep related data close together, so you can do a sequential read of a bunch of data and save time.

Temporal locality
Data stays in main-memory, but it cannot stay in the cache, or the cache would stop being useful. Only the most recently used data can be found in the cache; old data gets pushed out.
This is related to temporal locality. Data has strong temporal locality if it is accessed at the same time.
This is important because if item A is in the cache (good) than Item B (with strong temporal locality to A) is very likely to also be in the cache.

Footnote 1:

This is a simplification with latency cycle counts estimated from various cpus for example purposes, but give you the right order-of-magnitude idea for typical CPUs.

In reality latency and bandwidth are separate factors, with latency harder to improve for memory farther from the CPU. But HW prefetching and/or out-of-order exec can hide latency in some cases, like looping over an array. With unpredictable access patterns, effective memory throughput can be much lower than 10% of L1d cache.

For example, L2 cache bandwidth is not necessarily 3x worse than L1d bandwidth. (But it is lower if you're using AVX SIMD to do 2x 32-byte loads per clock cycle from L1d on a Haswell or Zen2 CPU.)

This simplified version also leaves out TLB effects (page-granularity locality) and DRAM-page locality. (Not the same thing as virtual memory pages). For a much deeper dive into memory hardware and tuning software for it, see What Every Programmer Should Know About Memory?

Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why a multi-level cache hierarchy is necessary to get the combination of latency/bandwidth and capacity (and hit-rate) we want.

One huge fast L1-data cache would be prohibitively power-expensive, and still not even possible with as low latency as the small fast L1d cache in modern high-performance CPUs.

In multi-core CPUs, L1i/L1d and L2 cache are typically per-core private caches, with a shared L3 cache. Different cores have to compete with each other for L3 and memory bandwidth, but each have their own L1 and L2 bandwidth. See How can cache be that fast? for a benchmark result from a dual-core 3GHz IvyBridge CPU: aggregate L1d cache read bandwidth on both cores of 186 GB/s vs. 9.6 GB/s DRAM read bandwidth with both cores active. (So memory = 10% L1d for single-core is a good bandwidth estimate for desktop CPUs of that generation, with only 128-bit SIMD load/store data paths). And L1d latency of 1.4 ns vs. DRAM latency of 72 ns

What is locality of reference?

Tags:

Memory

Caching

Cpu Architecture

Cpu Cache

Related

Recent Posts