Approximate cost to access various caches and main memory?

Cost to access various memories in a pretty page

  • See this page presenting the memory latency decrease from 1990 to 2020.

Summary

  1. Values having decreased but are stabilized since 2005

            1 ns        L1 cache
            3 ns        Branch mispredict
            4 ns        L2 cache
           17 ns        Mutex lock/unlock
          100 ns        Main memory (RAM)
        2 000 ns (2µs)  1KB Zippy-compress
    
  2. Still some improvements, prediction for 2020

       16 000 ns (16µs) SSD random read (olibre's note: should be less)
      500 000 ns (½ms)  Round trip in datacenter
    2 000 000 ns (2ms)  HDD random read (seek)
    

See also other sources

  • What every programmer should know about memory from Ulrich Drepper (2007)
    Old but still an excellent deep explanation about memory hardware and software interaction.
    • Full PDF (114 pages)
      • Comments on LWN about PDF version
      • Another ones
    • Seven posts on LWN + Comments
      • Part 1 - Introduction
      • Part 2 - Cache
      • Part 3 - Virtual Memory
      • Part 4 - NUMA support
      • Part 5 - What programmers can do
      • Part 6 - More things programmers can do
      • Part 7 - Memory performance tools
  • Post The Infinite Space Between Words in codinghorror.com based on book Systems Performance: Enterprise and the Cloud
  • Click to each processor listed on http://www.7-cpu.com/ to see the L1/L2/L3/RAM/... latencies (e.g. Haswell i7-4770 has L1=1ns, L2=3ns, L3=10ns, RAM=67ns, BranchMisprediction=4ns)
  • http://idarkside.org/posts/numbers-you-should-know/

See also

For further understanding, I recommend the excellent presentation of modern cache architectures (June 2014) from Gerhard Wellein, Hannes Hofmann and Dietmar Fey at University Erlangen-Nürnberg.

French speaking people may appreciate an article by SpaceFox comparing a processor with a developer both waiting for information required to continue to work.


Numbers everyone should know

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

From: Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

a visual comparison


Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).

Additionally, this page has some details on clock cycles etc. The second link served the following numbers:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2:
The most important is the notice under the cited table, saying:

"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS, NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY."

EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).