Variance in RDTSC overhead

You should make sure that frequency throttling/green functionality is disabled at the OS level. Restart the machine. You may otherwise have a situation where the cores have unsynchronized time stamp counter values.

The 243 reading is by far the most common which is one reason for using it. On the other hand suppose you get an elapsed time <243: you subtract the overhead and get an underflow. Since the arithmetic is unsigned you end up with an enormous result. This fact speaks for using the lowest reading (234) instead. It is extremely difficult to accurately measure sequences that are only a few cycles long. On a typical x86 @ a few GHz I'd recommend against timing sequences shorter than 10ns and even at that length they would typically be far from rock-solid.

The rest of my answer here is what I do, how I handle the results and my reasoning on the subject matter.

As to overhead the easiest way is to use code such as this

unsigned __int64 rdtsc_inline (void);
unsigned __int64 rdtsc_function (void);

The first form emits the rdtsc instruction into the generated code (as in your code). The second will cause a the function to be called, the rdtsc executed and a return instruction. Perhaps it will generate stack frames. Obviously the second form is much slower than the first.

The (C) code for overhead calculation can then be written

unsigned __int64 start_cycle,end_cycle;    /* place these @ the module level*/

unsigned __int64 overhead;
    
/* place this code inside a function */
    
start_cycle=rdtsc_inline();
  end_cycle=rdtsc_inline();
overhead=end_cycle-start_cycle;

If you are using the inline variant you will get a low(er) overhead. You will also run the risk of calculating an overhead which is greater than it "should" be (especially for the function form) which in turn means that if you measure very short/fast sequences you may run into having a previously calculated overhead which is greater than the measurement itself. When you attempt to adjust for the overhead you will get an underflow which will lead to messy conditions. The best way to handle this is to

  1. time the overhead several times and always use the smallest value achieved,
  2. not measure really short code sequences as you may run into pipelining effects which will require messy synchronising instructions before the rdtsc instruction and
  3. if you must measure very short sequences regard the results as indications rather than as facts

I've previously discussed what I do with the results in this thread.

Another thing that I do is to integrate the measurement code into the application. The overhead is insignificant. After a result has been computed I send it to a special structure where I count the number of measurements, sum x and x^2 values and determine min and max measurements. Later on I can use the data to calculate average and standard deviation. The structure itself is indexed and I can measure different performance aspects such individual application functions ("functional performance"), time spent in cpu, disk read/writing, network read/writing ("non-functional performance") etc.

If an application is instrumented in this manner and monitored from the very beginning I expect that the risk of it having performance problems during its lifetime will be greatly reduced.


The Intel Programmer's manual recommends you use lfence;rdtsc or rdtscp if you want to ensure that instructions prior to the rdtsc have actually executed. This is because rdtsc isn't a serializing instruction by itself.


RDTSC can return inconsistent results for a number of reasons:

  • On some CPUs (especially certain older Opterons), the TSC isn't synchronized between cores. It sounds like you're already handling this by using sched_setaffinity -- good!
  • If the OS timer interrupt fires while your code is running, there'll be a delay introduced while it runs. There's no practical way to avoid this; just throw out unusually high values.
  • Pipelining artifacts in the CPU can sometimes throw you off by a few cycles in either direction in tight loops. It's perfectly possible to have some loops that run in a non-integer number of clock cycles.
  • Cache! Depending on the vagaries of the CPU cache, memory operations (like the write to times[]) can vary in speed. In this case, you're fortunate that the std::vector implementation being used is just a flat array; even so, that write can throw things off. This is probably the most significant factor for this code.

I'm not enough of a guru on the Core2 microarchitecture to say exactly why you're getting this bimodal distribution, or how your code ran faster those 28 times, but it probably has something to do with one of the reasons above.