RDTSCP versus RDTSC + CPUID
Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?
Modern x86 CPUs can dynamically adjust the frequency to save power by under clocking (e.g. Intel's SpeedStep) and to boost performance for heavy load by over-clocking (e.g. Intel's Turbo Boost). The time stamp counter on these modern processors however counts at a constant rate (e.g. look for "constant_tsc" flag in Linux's /proc/cpuinfo).
So the answer to your question depends on what you really want to know. Unless, dynamic frequency scaling is disabled (e.g. in the BIOS) the time stamp counter can no longer be relied on to determine the number of cycles that have elapsed. However, the time stamp counter can still be relied on to determine the time that has elapsed (with some care - but I use clock_gettime
in C - see the end of my answer).
To benchmark my matrix multiplication code and compare it to the theoretical best I need to know both the time elapsed and the cycles elapsed (or rather the effective frequency during the test).
Let me present three different methods to determine the number of cycles elapsed.
- Disable dynamic frequency scaling in the BIOS and use the time stamp counter.
- For Intel processors request the
core clock cycles
from the performance monitor counter. - Measure the frequency under load.
The first method is the most reliable but it requires access to BIOS and affects the performance of everything else you run (when I disable dynamic frequency scaling on my i5-4250U it runs at a constant 1.3 GHz instead of a base of 2.6 GHz). It's also inconvenient to change the BIOS only for benchmarking.
The second method is useful when you don't want to disable dynamic frequency scale and/or for systems you don't have physical access to. However, the performance monitor counters require privileged instructions which only the kernel or device drivers have access to.
The third method is useful on systems where you don't have physical access and do not have privileged access. This is the method I use most in practice. It's in principle the least reliable but in practice it's been as reliable as the second method.
Here is how I determine the time elapsed (in seconds) with C.
#define TIMER_TYPE CLOCK_REALTIME
timespec time1, time2;
clock_gettime(TIMER_TYPE, &time1);
foo();
clock_gettime(TIMER_TYPE, &time2);
double dtime = time_diff(time1,time2);
double time_diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return (double)temp.tv_sec + (double)temp.tv_nsec*1E-9;
}
A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).
You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.
Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:
uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();
atomic_add(e - s, &acc);
atomic_add(1, &counter);
You may still have an off-by-one in your average measurement depending on where your read happens. For instance:
T1 T2
t0 atomic_add(e - s, &acc);
t1 a = atomic_read(&acc);
t2 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4 avg = a / c;
It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.
Side-points:
- If you do use cpuid+rdtsc, you need to subtract out the cost of the cpuid instruction, which may be difficult to ascertain if you're in a VM (depending on how the VM implements this instruction). This is really why you should stick with rdtscp.
- Executing rdtscp inside a loop is usually a bad idea. I somewhat frequently see microbenchmarks that do things like
--
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
s = rdtscp();
loop_body();
e = rdtscp();
acc += e - s;
}
printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body()
, it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:
s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));
Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.
The 2010 Intel paper How to Benchmark Code Execution Times on Intel ® IA-32 and IA-64 Instruction Set Architectures can be considered as outdated when it comes to its recommendations to combine RDTSC/RDTSCP with CPUID.
Current Intel reference documentation recommends fencing instructions as more efficient alternatives to CPUID:
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.
(Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 8.2.5, September 2016)
If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
(Intel RDTSC)
Thus, to get the TSC start value you execute this instruction sequence:
mfence
lfence
rdtsc
shl rdx, 0x20
or rax, rdx
At the end of your benchmark, to get the TSC stop value:
rdtscp
lfence
shl rdx, 0x20
or rax, rdx
Note that in contrast to CPUID, the lfence instruction doesn't clobber any registers, thus it isn't necessary to rescue the EDX:EAX
registers before executing the serializing instruction.
Relevant documentation snippet:
If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP (Intel RDTSCP)
As an example how to integrate this into a C program, see also my GCC inline assembler implementations of the above operations.
The following code will ensure that rdstcp
kicks in at exactly the right time.
RDTSCP
cannot execute too early, but it can execute to late because the CPU can move instructions after rdtscp
to execute before it.
In order to prevent this we create a false dependency chain based on the fact that rdstcp
puts its output in edx:eax
rdtscp ;rdstcp is read serialized, it will not execute too early.
;also ensure it does not execute too late
mov r8,rdx ;rdtscp changes rdx and rax, force dependency chain on rdx
xor r8,rbx ;push rbx, do not allow push rbx to execute OoO
xor rbx,rdx ;rbx=r8
xor rbx,r8 ;rbx = 0
push rdx
push rax
mov rax,rbx ;rax = 0, but in a way that excludes OoO execution.
cpuid
pop rax
pop rdx
mov rbx,r8
xor rbx,rdx ;restore rbx
Note that even though this time is accurate up to a single cycle.
You still need to run your sample many many times and take the lowest time of those many runs in order to get the actual running time.