Using 1GB pages degrade performance
Intel was kind enough to reply to this issue. See their answer below.
This issue is due to how physical pages are actually committed. In case of 1GB pages, the memory is contiguous. So, as soon as you write to any one byte within the 1GB page, the entire 1GB page is assigned. However, with 4KB pages, the physical pages get allocated as and when you touch for the first time in each of the 4KB pages.
for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
ptr[index] = index * 5;
}
}
In the innermost loop, the index changes at a stride of 512KB. So, consecutive references map at 512KB offsets. Typically caches have 2048 sets (which is 2^11). So, bits 6:16 select the sets. But if you stride at 512KB offsets, bits 6:16 would be the same ending up selecting the same set and losing the spatial locality.
We would recommend initializing the entire 1GB buffer sequentially (in the small page test) as below before starting the clock to time it
for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
ptr[i] = i * 5;
Basically, the issue is with set conflicts resulting in cache misses in case of huge pages compared to small pages due to very large constant offsets. When you use constant offsets, the test is really not random.
Not an answer, but to provide more details to this perplexing issue.
Performance counters show roughly similar number of instructions, but roughly twice the number of cycles spent when huge pages are used:
- 4KiB pages IPC 0.29,
- 1GiB pages IPC 0.10.
These IPC numbers say that the code is bottlenecked on memory access (CPU bound IPC on Skylake is 3 and above). Huge pages bottleneck harder.
I modified your benchmark to use MAP_POPULATE | MAP_LOCKED | MAP_FIXED
with fixed address 0x600000000000
for both cases to eliminate time variation associated with page faults and random mapping address. On my Skylake system 2MiB and 1GiB are more than 2x slower than 4kiB pages.
Compiled with g++-8.4.0 -std=gnu++14 -pthread -m{arch,tune}=skylake -O3 -DNDEBUG
:
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 2MB:64 --pool-pages-max 2MB:64
[max@supernova:~/src/test] $ sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1
[max@supernova:~/src/test] $ for s in small huge; do sudo chrt -f 40 taskset -c 7 perf stat -dd ./release/gcc/test $s random; done
Duration: 2156150
Performance counter stats for './release/gcc/test small random':
2291.190394 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.023 K/sec
11,448,252,551 cycles # 4.997 GHz (30.83%)
3,268,573,978 instructions # 0.29 insn per cycle (38.55%)
430,248,155 branches # 187.784 M/sec (38.55%)
758,917 branch-misses # 0.18% of all branches (38.55%)
224,593,751 L1-dcache-loads # 98.025 M/sec (38.55%)
561,979,341 L1-dcache-load-misses # 250.22% of all L1-dcache hits (38.44%)
271,067,656 LLC-loads # 118.309 M/sec (30.73%)
668,118 LLC-load-misses # 0.25% of all LL-cache hits (30.73%)
<not supported> L1-icache-loads
220,251 L1-icache-load-misses (30.73%)
286,864,314 dTLB-loads # 125.203 M/sec (30.73%)
6,314 dTLB-load-misses # 0.00% of all dTLB cache hits (30.73%)
29 iTLB-loads # 0.013 K/sec (30.73%)
6,366 iTLB-load-misses # 21951.72% of all iTLB cache hits (30.73%)
2.291300162 seconds time elapsed
Duration: 4349681
Performance counter stats for './release/gcc/test huge random':
4385.282466 task-clock (msec) # 1.000 CPUs utilized
1 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
53 page-faults # 0.012 K/sec
21,911,541,450 cycles # 4.997 GHz (30.70%)
2,175,972,910 instructions # 0.10 insn per cycle (38.45%)
274,356,392 branches # 62.563 M/sec (38.54%)
560,941 branch-misses # 0.20% of all branches (38.63%)
7,966,853 L1-dcache-loads # 1.817 M/sec (38.70%)
292,131,592 L1-dcache-load-misses # 3666.84% of all L1-dcache hits (38.65%)
27,531 LLC-loads # 0.006 M/sec (30.81%)
12,413 LLC-load-misses # 45.09% of all LL-cache hits (30.72%)
<not supported> L1-icache-loads
353,438 L1-icache-load-misses (30.65%)
7,252,590 dTLB-loads # 1.654 M/sec (30.65%)
440 dTLB-load-misses # 0.01% of all dTLB cache hits (30.65%)
274 iTLB-loads # 0.062 K/sec (30.65%)
9,577 iTLB-load-misses # 3495.26% of all iTLB cache hits (30.65%)
4.385392278 seconds time elapsed
Ran on Ubuntu 18.04.5 LTS with Intel i9-9900KS (which is not NUMA), 4x8GiB 4GHz CL17 RAM in all 4 slots, with performance
governor for no CPU frequency scaling, liquid cooling fans on max for no thermal throttling, FIFO 40 priority for no preemption, on one specific CPU core for no CPU migration, multiple runs. The results are similar with clang++-8.0.0
compiler.
It feels like something is fishy in hardware, like a store buffer per page frame, so that 4KiB pages allow for ~2x more stores per unit of time.
Would be interesting to see results for AMD Ryzen 3 CPUs.
On AMD Ryzen 3 5950X the huge pages version is only up to 10% slower:
Duration: 1578723
Performance counter stats for './release/gcc/test small random':
1,726.89 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1,947 page-faults # 0.001 M/sec
8,189,576,204 cycles # 4.742 GHz (33.02%)
3,174,036 stalled-cycles-frontend # 0.04% frontend cycles idle (33.14%)
95,950 stalled-cycles-backend # 0.00% backend cycles idle (33.25%)
3,301,760,473 instructions # 0.40 insn per cycle
# 0.00 stalled cycles per insn (33.37%)
480,276,481 branches # 278.116 M/sec (33.49%)
864,075 branch-misses # 0.18% of all branches (33.59%)
709,483,403 L1-dcache-loads # 410.844 M/sec (33.59%)
1,608,181,551 L1-dcache-load-misses # 226.67% of all L1-dcache accesses (33.59%)
<not supported> LLC-loads
<not supported> LLC-load-misses
78,963,441 L1-icache-loads # 45.726 M/sec (33.59%)
46,639 L1-icache-load-misses # 0.06% of all L1-icache accesses (33.51%)
301,463,437 dTLB-loads # 174.570 M/sec (33.39%)
301,698,272 dTLB-load-misses # 100.08% of all dTLB cache accesses (33.28%)
54 iTLB-loads # 0.031 K/sec (33.16%)
2,774 iTLB-load-misses # 5137.04% of all iTLB cache accesses (33.05%)
243,732,886 L1-dcache-prefetches # 141.140 M/sec (33.01%)
<not supported> L1-dcache-prefetch-misses
1.727052901 seconds time elapsed
1.579089000 seconds user
0.147914000 seconds sys
Duration: 1628512
Performance counter stats for './release/gcc/test huge random':
1,680.06 msec task-clock # 1.000 CPUs utilized
1 context-switches # 0.001 K/sec
1 cpu-migrations # 0.001 K/sec
1,947 page-faults # 0.001 M/sec
8,037,708,678 cycles # 4.784 GHz (33.34%)
4,684,831 stalled-cycles-frontend # 0.06% frontend cycles idle (33.34%)
2,445,415 stalled-cycles-backend # 0.03% backend cycles idle (33.34%)
2,217,699,442 instructions # 0.28 insn per cycle
# 0.00 stalled cycles per insn (33.34%)
281,522,918 branches # 167.567 M/sec (33.34%)
549,427 branch-misses # 0.20% of all branches (33.33%)
312,930,677 L1-dcache-loads # 186.261 M/sec (33.33%)
1,614,505,314 L1-dcache-load-misses # 515.93% of all L1-dcache accesses (33.33%)
<not supported> LLC-loads
<not supported> LLC-load-misses
888,872 L1-icache-loads # 0.529 M/sec (33.33%)
13,140 L1-icache-load-misses # 1.48% of all L1-icache accesses (33.33%)
9,168 dTLB-loads # 0.005 M/sec (33.33%)
870 dTLB-load-misses # 9.49% of all dTLB cache accesses (33.33%)
1,173 iTLB-loads # 0.698 K/sec (33.33%)
1,914 iTLB-load-misses # 163.17% of all iTLB cache accesses (33.33%)
253,307,275 L1-dcache-prefetches # 150.772 M/sec (33.33%)
<not supported> L1-dcache-prefetch-misses
1.680230802 seconds time elapsed
1.628170000 seconds user
0.052005000 seconds sys