Understanding TLB from CPUID results on Intel
How to query the number of levels of TLB cache in case some x86 vendor decides to provide 3 levels of TLB?
Leaf 0x2 may return TLB information only on Intel processors. It's reserved on all current AMD processors. On all current Intel processors, there is no single number that tells you the number of TLB levels. The only way to determine the number of levels is by enumerating all the TLB-related cpuid
leafs or subleafs. The following algorithm works on all current Intel processors that support the cpuid
instruction (up to and including Ice Lake, Goldmont Plus, and Knights Mill):
- Check whether the value 0xFE exists in any of the four registers EAX, EBX, ECX, and EDX returned when
cpuid
is executed with EAX set to leaf 0x2. - If 0xFE doesn't exist, enumerate all the bytes in the four registers. Based on Table 3-12 of the Intel manual Volume 2 (number 325383-070US), there will be either one or two descriptors of data TLBs that can cache 4KB translations. The Intel manual uses the following different names for TLBs that may cache data access translations: Data TLB, Data TLB0, Data TLB1, DTLB, uTLB, and Shared 2nd-Level TLB. If there are two such descriptors, then the number of levels is two. The descriptor with the larger number of TLB numbers is the one for the second-level TLB. If there is only one such descriptor, the number of levels is one.
- If 0xFE exists, the TLB information needs to be obtained from
cpuid
leaf 0x18. Enumerate all the valid subleafs up to the maximum valid subleaf number. If there is at least one subleaf with the least two significant bits of EDX equal to 11, then the number of TLB levels is two. Otherwise, the number of TLB levels is one.
The TLB information for Ice Lake and Goldmont Plus processors is present in leaf 0x18. This leaf provides more flexibility in encoding TLB information. The TLB information for all other current Intel processors is present in leaf 0x2. I don't know about Knights Mill (if someone has access to a Knights Mill, please consider sharing the cpuid
dump).
Determining the number of TLB levels is not sufficient to fully describe how the levels are related to each other. Current Intel processors implement two different 2-level TLB hierarchies:
- The second-level TLB can cache translations for data loads (including prefetches), data stores, and instruction fetches. The second-level TLB is called in this case "Shared 2nd-Level TLB."
- The second-level TLB can cache translations for data loads and stores, but not instruction fetches. The second-level TLB is called in this case any of the following: Data TLB, Data TLB1, or DTLB.
I'll discuss a couple of examples based on the cpuid
dumps from InstLatx64. On one of the Haswell processors with hyperthreading enabled, leaf 0x2 provides the following information in the four registers:
76036301-00F0B5FF-00000000-00C10000
There is no 0xFE, so the TLB information is present in this leaf itself. According to Table 3-12:
76: Instruction TLB: 2M/4M pages, fully associative, 8 entries
03: Data TLB: 4 KByte pages, 4-way set associative, 64 entries
63: Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries and a separate array with 1 GByte pages, 4-way set associative, 4 entries
B5: Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
C1: Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries
The other bytes are not relevant to TLBs.
There is one discrepancy compared to Table 2-17 of the Intel optimization manual (number 248966-042b). Table 2-17 mentions that the instruction TLB for 4KB entries has 128 entries, 4-way associative, and is dynamically partitioned between the two hyperthreads. But the TLB dump says that it's 8-way associative and there are only 64 entries. There is actually no encoding for a 4-way ITLB with 128-entries, so I think the manual is wrong. Anyway, C1 shows that there are two TLB levels and the second level caches data and instruction translations.
On one of the Goldmont processors, leaf 0x2 provides the following information in the four registers:
6164A001-0000FFC4-00000000-00000000
Here is the interpretation of the TLB-relevant bytes:
61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
64: Data TLB: 4 KByte pages, 4-way set associative, 512 entries
A0: DTLB: 4k pages, fully associative, 32 entries
C4: DTLB: 2M/4M Byte pages, 4-way associative, 32 entries
There are two data TLBs for 4KB pages, one has 512 entries and the other has 32 entries. This means that the processor has two levels of TLBs. The second level is called "Data TLB" and so it can only cache data translations.
Table 19-4 of the optimization manual mentions that the ITLB in Goldmont supports large pages, but this information is not present in the TLB information. The data TLB information is consistent with Table 19-7 of the manual, except that the "Data TLB" and "DTLB" are called "DTLB" and "uTLB", respectively, in the manual.
On one of the Knights Landing processors, leaf 0x2 provides the following information in the four registers:
6C6B6A01-00FF616D-00000000-00000000
6C: DTLB: 2M/4M pages, 8-way set associative, 128 entries
6B: DTLB: 4 KByte pages, 8-way set associative, 256 entries
6A: uTLB: 4 KByte pages, 8-way set associative, 64 entries
61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
6D: DTLB: 1 GByte pages, fully associative, 16 entries
So there are two TLB levels. The first one consists of multiple structures for different page sizes. The TLB for 4KB pages is called uTLB and the TLBs for the other pages sizes are called DTLBs. The second level TLB is called DTLB. These numbers and names are consistent with Table 20-3 from the manual.
Silvermont processors provide the following TLB information:
61B3A001-0000FFC2-00000000-00000000
61: Instruction TLB: 4 KByte pages, fully associative, 48 entries
B3: Data TLB: 4 KByte pages, 4-way set associative, 128 entries
A0: DTLB: 4k pages, fully associative, 32 entries
C2: DTLB: 4 KByte/2 MByte pages, 4-way associative, 16 entries
This information is consistent with the manual, except for C2. I think it should say "4 MByte/2 MByte" instead of "4 KByte/2 MByte." It's probably a typo in the manual.
The Intel Penryn microarchitecture is an example where the TLB information uses the names TLB0 and TLB1 to refer to the first and second level TLBs:
05: Data TLB1: 4 MByte pages, 4-way set associative, 32 entries
B0: Instruction TLB: 4 KByte pages, 4-way set associative, 128 entries
B1: Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries
56: Data TLB0: 4 MByte pages, 4-way set associative, 16 entries
57: Data TLB0: 4 KByte pages, 4-way associative, 16 entries
B4: Data TLB1: 4 KByte pages, 4-way associative, 256 entries
Older Intel processors have single-level TLB hierarchies. For example, here is the TLB information for Prescott:
5B: Data TLB: 4 KByte and 4 MByte pages, 64 entries
50: Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 64 entries
All Intel 80386 processors and some Intel 80486 processors include a single-level TLB hierarchy, but don't support the cpuid
instruction. On processors earlier than 80386, there is no paging. If you want the algorithm above to work on all Intel x86 processors, you'll have to consider these cases as well. The Intel document number 241618-025 titled "Processor Identification and the CPUID Instruction," which can be found here, discusses how to handle these cases in Chapter 7.
I'll discuss an example where the TLB information is present in leaf 0x18 rather than leaf 0x2. Like I said earlier, the only existing Intel processors that have the TLB information present in 0x18 are Ice Lake and Goldmont Plus processors (and maybe Knights Mill). The leaf 0x2 dump for an Ice Lake processor is:
00FEFF01-000000F0-00000000-00000000
There is an 0xFE byte, so the TLB information is present in the more powerful leaf 0x18. Subleaf 0x0 of leaf 0x18 specifies that the maximum valid subleaf is 0x7. Here are the dumps for subleafs 0x0 to 0x7:
00000007-00000000-00000000-00000000 [SL 00]
00000000-00080007-00000001-00004122 [SL 01]
00000000-0010000F-00000001-00004125 [SL 02]
00000000-00040001-00000010-00004024 [SL 03]
00000000-00040006-00000008-00004024 [SL 04]
00000000-00080008-00000001-00004124 [SL 05]
00000000-00080007-00000080-00004043 [SL 06]
00000000-00080009-00000080-00004043 [SL 07]
The Intel manual describes how to decode these bits. Each valid subleaf describes a single TLB structure. A subleaf is valid (i.e., describes a TLB structure) if the least significant five bits of EDX are not all zeros. Hence, subleaf 0x0 is invalid. The next seven subleafs are all valid, which means that there are 7 TLB descriptors in an Ice Lake processor. The least significant five bits of EDX specify the type of the TLB and the next three bits specify the level of the TLB. The following information is obtained by decoding the subleaf bits:
- [SL 01]: Describes a first-level instruction TLB that is an 8-way fully associative cache capable of caching translations for 4KB, 2MB, and 4MB pages.
- [SL 02]: The least significant five bits represent the number 5, which is a reserved encoding according to the most recent version of the manual (Volume 2). The other bits specify a TLB that is 16-way fully associative and capable of caching translations for all page sizes. Intel has provided information on the TLBs in Ice Lake in Table 2-5 of the optimization manual. The closest match shows that the reserved encoding 5 most likely represents a first-level TLB for data store translations.
- [SL 03]: The least significant five bits represent the number 4, which is also a reserved encoding according to the most recent version of the manual. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 4KB translations. The number of ways and sets matches Table 2-5.
- [SL 04]: Similar to subleaf 0x3. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 2MB and 4MB translations. The number of ways and sets matches Table 2-5.
- [SL 05]: Similar to subleaf 0x3. The closest match with Table 2-5 suggests that it represents a first-level TLB for data loads that can cache 1GB translations. The number of ways and sets matches Table 2-5.
- [SL 06]: Describes a second-level unified TLB consisting of 8 ways and 128 sets and capable of caching translations for 4KB, 2MB, and 4MB pages.
- [SL 07]: Describes a second-level unified TLB consisting of 8 ways and 128 sets and capable of caching translations for 4KB and 1GB pages.
Table 2-5 actually mentions that there is only one unified TLB structure, but half of the ways can only cache translations for 4KB, 2MB, and 4MB pages and the other half can only cache translations for 4KB and 1GB pages. So the TLB information for the second-level TLB is consistent with the manual. However, the TLB information for the instruction TLB is not consistent with Table 2-5. The manual is probably correct. The ITLB for 4KB pages seems to be mixed up with that for 2MB and 4MB pages in the TLB information dump.
On AMD processors, the TLB information for the first-level and second-level TLBs is provided in leafs 8000_0005 and 8000_0006, respectively. More information can be found in the AMD manual Volume 3. AMD processors earlier than the K5 don't support the cpuid
and some of these processors include a single-level TLB. So if you care about these processors, you need an alternative mechanism to determine whether a TLB exists. Zen 2 adds 1GB support at both TLB levels. Information on these TLBs can be found in leaf 8000_0019.
AMD Zen has a three-level instruction TLB hierarchy according to AMD. This is the first core microarchitecture that I know of that uses a three-level TLB hierarchy. Most probably this is also the case on AMD Zen+ and AMD Zen 2 (but I couldn't find an AMD source that confirms this). There appears to be no documented cpuid
information on the L0 ITLB. So you'll probably have to check whether the processor is AMD Zen or later and provide the L0 ITLB information (8 entries for all page sizes, probably fully associative) manually for these processors.
Is "4-way associative" here just a typo meaning that "4-way set associative"?
It's not a typo. These terms are synonyms and both are commonly used.
Does DTLB stand for Data TLB? What does uTLB mean? uosp-TLB? Which TLB cache level is considered here?
DTLB and uTLB are both names for data TLBs. The DTLB name is used for both the first-level and second-level TLBs. The uTLB name is only used for the first-level data TLB and is short for micro-TLB.
Does this mean that in that case the 2-nd level TLB is shared among all cores? So when not specified explicitly is the TLB cache core private?
The term "shared" here means "unified" as in both data and instruction translations can be cached. Intel should have called it UTLB (capital U) or Unified TLB, which is the name used in the modern leaf 0x18.
Collecting my comments into an answer. Hadi's answer more directly answers more of the question, but this is hopefully useful background about TLBs to help you understand why it's designed that way and what it means.
You can look up known microarchitecture details to help check your interpretation of cpuid
results. For example, https://www.7-cpu.com/cpu/Skylake.html and https://www.realworldtech.com/haswell-cpu/5/ have details about those Intel uarches. Other sources include Intel's optimization manual, and maybe Agner Fog's microarch guide. IDK why some say "set" associative and others don't; that's not significant AFAIK.
(And in some cases apply common-sense reasoning about what would be a sane design. Surprising results might be correct but need more checking.)
Does it mean that there are only 2 levels of TLB?
Yes, mainstream x86 CPUs still "only" use 2 level TLBs, with the 2nd level being unified (instruction/data translations).
First level being split L1iTLB (tightly coupled to the front-end fetch stage) and L1dTLB (tightly coupled to load/store units). Second level TLB being unified.
On current Intel CPUs, the L2TLB is basically a victim cache; a page walker result is only added to the L1 TLB that needed it, only moving to L2TLB after eviction from L1iTLB or L1dTLB. I forget if they're exclusive (i.e. exchange entries to make sure there's no duplication), but I don't think so. Anyway, fun fact: keeping code and data in the same page can still trigger a separate page walk for code and for data because the iTLB miss for code won't put the result anywhere that can be seen by the dTLB miss, not right away. At least the page-table data itself will be in L1d cache where the page walker can get at it quickly, if the accesses are close together in time.
Does this mean that in that case the 2nd level TLB is shared among all cores? So when not specified explicitly is the TLB cache core private?
TLBs are always per-core private, and there are major problems in designing a way to share entries even if you wanted to.
Unlike memory contents, translations and invlpg
invalidations are per-core private. Each logical core has its own CR3 pointer to a top-level page directory. Sometimes multiple cores are running threads of the same process so they have the same CR3, but sometimes not. A shared TLB across cores would be of limited value unless the x86 ISA systems-programming details were extended with the concept of PTEs that were global across cores, not just across CR3 changes on one core. (Those across-CR3-change entries are intended for kernels that keep kernel virtual address space mapped all the time, but the semantics are defined in terms of per-core behaviour not truly global.) IIRC, PCID (process context ID) stuff also assumes that IDs are per-core private, so even that wouldn't help enable sharing. Note that with Meltdown mitigation enabled, entering the kernel does change the page tables so even common real-life use-cases aren't ideal.
So anyway, there's a huge amount of potential complexity in tagging shared TLB entries to maintain correctness according to existing ISA rules. With hyperthreading enabled, Sandybridge even statically partitions the small-page L1iTLB between logical cores, and replicates the hugepage L1iTLB (Kanter, RealWorldTech).
Also, it's not the best way to improve performance. Going off-core to a shared resource tends to be slow; e.g. L3 data cache access is many cycles. TLB entries can be rebuilt from the page-table data which can itself be cached by L3 data cache. (And also by private L2 and L1d caches; Hardware page-walk fetches through the data caches on PPro and later (fun fact: unlike P5 Pentium that bypassed its on-chip caches)).
Instead of going off-core (with latency presumably similar to L3 cache) to check a hypothetical shared L3TLB (which might still miss), it makes a lot more sense just to rebuild a TLB entry with local page-walk hardware. Skylake added a 2nd HW page-walker which lets it work on two TLB misses (or speculative fills) in parallel; this presumably helps more than a shared L3TLB would, even in the best-case scenario of all cores running threads of the same process with a lot of shared working-set. Processing the data from a page-table into TLB entries is probably a small part of the total cycles if the page-table data has to come from off-core.
Caching page-table data (like higher level page-directory entries) within the page-walkers helps, too, and is done in practice I think. So a page-walk might only need to fetch the bottom 2 levels for example through data caches.
TL:DR: fast page-walk hardware reading from existing private + shared data caches, and speculative TLB prefetch, solves the same problem a shared TLB might, as well as helping performance in separate-process cases. Also avoiding many problems.
Adding even more / even better page-walk hardware would do more to help more cases than a shared L3TLB.
Does DTLB stand for Data TLB? What does uTLB mean? uops-TLB? Which TLB cache level is considered here?
Yes, DTLB = Data TLB.
uTLB can't be for the uop cache; on Intel CPUs the uop cache is virtually addressed so it doesn't need a TLB. (Not sure what Ryzen's uop-cache does, but you're looking at Intel docs).
From the size and other stuff, we can see that it's not the Unified L2TLB either. (Although from Hadi's answer, it seems that UTLB might in some cases mean Unified, i.e. combined or shared data and instructions)
I found https://software.intel.com/en-us/vtune-amplifier-help-utlb-overhead which doesn't seems to be saying that UTLB = first-level data TLB. Maybe it means "micro TLB" as in small/fast TLB with only a few entries, vs. the much larger L2TLB.
Hadi found that on some Silvermont-family CPUs, "uTLB" is for 4k pages while DTLB is for other page sizes. It does seem like "micro TLB" is the right way to interpret it.
I also found https://wikichip.org/wiki/intel/microarchitectures/kaby_lake resource regarding TLB. There is a Note: STLB is incorrectly reported as "6-way" by CPUID leaf 2 (EAX=02H). Kaby Lake erratum KBL096 recommends software to simply ignore that value. which is actually 12-way associative.
cpuid
bug for all Kaby Lake cpus?
Yes, it's a CPU bug that the CPU reports the wrong information via CPUID; that's why KBL096 is a CPU erratum, not a bug in software that uses cpuid
.
If such software followed the normal rules, it would get results that don't match what KBL actually has. Intel is recommending that software special-case this and simply print the known correct result instead of what the cpuid
data indicates.