rdpmc: surprising behavior
The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf
does this, along with resetting them to zero before starting a program.
The fixed counters (like the programmable counters) have bits that control whether
they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf
kernel code leaves them set to count neither when nothing is using them.
If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL
and IA32_FIXED_CTR_CTRL
MSRs), or get perf to do it for you by still running your program under perf
. e.g. perf stat ./a.out
If you use perf stat -e instructions:u ./perf ; echo $?
, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc
once. Otherwise, e.g. with the default -e instructions
(not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.
The exit status is only 8 bits wide, so this little hack to avoid printf or write()
only works for very small counts.
It also means its pointless to construct the full 64-bit rdpmc
result: the high 32 bits of the inputs don't affect the low 8 bits of a sub
result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.
Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.
segment .text
global _start
_start:
mov ecx, 1<<30 ; fixed counter: instructions
rdpmc
mov edi, eax ; start
mov edx, 10
.loop:
dec edx
jnz .loop
rdpmc ; ecx = same counter as before
sub eax, edi ; end - start
mov edi, eax
mov eax, 231
syscall ; sys_exit_group(rdpmc). sys_exit isn't wrong, but glibc uses exit_group.
Running this under perf stat ./a.out
or perf stat -e instructions:u ./a.out
, we always get 23
from echo $?
(instructions:u
shows 30, which is 1 more than the actual number of instructions this program runs, including syscall
)
23 instructions is exactly the number of instructions strictly after the first rdpmc
, but including the 2nd rdpmc
.
If we comment out the first rdpmc
and run it under perf stat -e instructions:u
, we consistently get 26
as the exit status, and 29
from perf
. rdpmc
is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start
). I wonder if the sysret
in the kernel gets counted as a "user" instruction.
But with the first rdpmc
commented out, running under perf stat -e instructions
(not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.
But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence
, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0
/ or ecx, 1<<30
works, because unlike xor-zeroing, and ecx,0
is not dependency-breaking.
Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc
is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.
PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf
requires is documented in perf_event_open(2)
:
echo 2 | sudo tee /sys/devices/cpu/rdpmc # enable RDPMC always, not just when a perf event is open
The first step is to ensure that the performance counters you want to use are enabled in the IA32_PERF_GLOBAL_CTRL
MSR register, whose layout is shown in Figure 18-8 of the Intel Manual Volume 3 (January 2019). You can easily do this by loading the MSR kernel module (sudo modprobe msr
) and executing the following command:
sudo rdmsr -a 0x38F
The value 0x38F is the address of the IA32_PERF_GLOBAL_CTRL
MSR register and the -a
option specifies that the rdmsr
instruction should be executed on all logical cores. By default, this should print 7000000ff
(when HT is disabled) or 70000000f
(when HT is enabled) for all logical cores. For the INST_RETIRED.ANY
fixed-function performance counter, the bit at index 32 is the one that enables it, so it should be 1. The value 7000000ff
that all of the three fixed-function counters and all of the eight programmable counters are enabled.
The IA32_PERF_GLOBAL_CTRL
register has one enable bit for each performance counter per logical core. Each programmable performance counter has also its dedicated control register and there is a control register for all of the fixed-function counters. In particular, the control register for the INST_RETIRED.ANY
fixed-function performance counter is IA32_FIXED_CTR_CTRL
, whose layout is shown in Figure 18-7 of the Intel Manual Volume 3. There are 12 defined bits in the register, the first 4 bits can be used to control the behavior of the the first fixed-function counter, i.e., INST_RETIRED.ANY
(the order is shown in Table 19-2). Before modifying the register, you should first check how it got initialized by the OS by executing:
sudo rdmsr -a 0x38D
It should print 0xb0, by default. This indicates that the second fixed-function counter (unhalted core cycles) is enabled and configured to count in both supervisor mode and user mode. To enable INST_RETIRED.ANY
and configure it to count only user mode events while keeping the unhalted core cycles counter as is, execute the following command:
sudo wrmsr -a 0x38D 0xb2
Once this command is executed, the events are counted immediately. You can check this by reading the first fixed-function counter IA32_PERF_FIXED_CTR0
(see Table 19-2):
sudo rdmsr -a 0x309
You can execute that command multiple times and see how the counts on each core are changing. Unfortunately, this means that by the time your program is run, the current value in IA32_PERF_FIXED_CTR0
will be basically some random value. You can try to reset the counter by executing:
sudo wrmsr -a 0x309 0
But the fundamental problem remains; you cannot instantaneously reset the counter and run your program. As suggested in @Peter's answer, the right way to use any performance counter is to wrap the region of interest between rdpmc
instructions and take the difference.
The MSR kernel module is very convenient because the only way to access MSR registers is in kernel mode. However, there is an alternative to wrapping the code between rdpmc
instructions. You can write your own kernel module and place your code in the kernel module immediately after the instruction that enables the counter. You can even disable interrupts. Typically, this level of accuracy is not worth the effort.
You can use the -p
option instead of -a
to specify a particular logical core. However, you'll have to make sure that the program is run on the same core with taskset -c 3 ./a.out
to run on core #3, for example.