What is microbenchmarking?

There is no definition of micro-benchmarking, but when I use it I mean a small artificial benchmark designed to test the performance of some specific hardware1 or language feature. By contrast, a better benchmark is a real program designed to perform a real task. (Drawing a hard line between the two cases is pointless, IMO, and I won't try.)

The danger of micro benchmarking is that it is easy to write a benchmark that gives results that are totally misleading. Some common traps in Java micro-benchmarks are:

  • writing code that the compiler can deduce does not useful work, and therefore optimize away entirely,
  • not taking account of the "lumpy" nature of Java memory management, and
  • not taking account of JVM startup effects; e.g. the time take to load and JIT compile classes, and (conversely) the execution speedup that happens once the methods have been JIT compiled.

However, even once you have addressed the issues above, there is a systemic issue with benchmarking that is impossible to address. The code and behavior of a benchmark usually bears little relation to what you really care about; i.e. how your application is going to perform. There are far too many "hidden variables" for you to be able to generalize from a benchmark to typical programs, let alone to your program.

For these reasons, we regularly advise people NOT to waste their time with micro-benchmarks. Instead, it is best to write simple and natural code, and use a profiler to identify areas that need to be hand optimized. Interestingly, it usually turns out that the most significant performance problems in real applications are due to bad design of data structures and algorithms (including networking, database and threading-related bottlenecks) rather than the kind of things that typical micro-benchmarks are trying to test.

@BalusC has provided an excellent link to material on this topic in the Hotspot FAQ page. And here is a link to an IBM whitepaper by Brian Goetz.


1 - Experts would not even try to do hardware benchmarking in Java. There are too many "complex things" happening between the bytecodes and the hardware to draw valid / useful conclusions about hardware from the raw results. You would be better off using a language that is closer to the hardware; e.g. C or even assembly code.


It means exactly what it says on the tin can - it's measuring the performance of something "small", like a system call to the kernel of an operating system.

The danger is that people may use whatever results they obtain from microbenchmarking to dictate optimizations. And as we all know:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" -- Donald Knuth

There can be many factors that skew the result of microbenchmarks. Compiler optimizations is one of them. If the operation being measured takes so little time that whatever you use to measure it takes longer than the actual operation itself, your microbenchmarks will be skewed also.

For example, someone might take a microbenchmark of the overhead of for loops:

void TestForLoop()
{
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

Obviously compilers can see that the loop does absolutely nothing and not generate any code for the loop at all. So the value of elapsed and elapsedPerIteration is pretty much useless.

Even if the loop does something:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
}

The compiler may see that the variable sum isn't going to be used for anything and optimize it away, and optimize away the for loop as well. But wait! What if we do this:

void TestForLoop()
{
    int sum = 0;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        ++sum;
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each iteration: %d\n", elapsedPerIteration);
    printf("Sum: %d\n", sum); // Added
}

The compiler might be smart enough to realize that sum will always be a constant value, and optimize all that away as well. Many would be surprised at the optimizing capabilities of compilers these days.

But what about things that compilers can't optimize away?

void TestFileOpenPerformance()
{
    FILE* file = NULL;
    time start = GetTime();

    for(int i = 0; i < 1000000000; ++i)
    {
        file = fopen("testfile.dat");
        fclose(file);
    }

    time elapsed = GetTime() - start;
    time elapsedPerIteration = elapsed / 1000000000;
    printf("Time elapsed for each file open: %d\n", elapsedPerIteration);
}

Even this is not a useful test! The operating system may see that the file is being opened very frequently, so it may preload it in memory to improve performance. Pretty much all operating systems do this. The same thing happens when you open applications - operating systems may figure out the top ~5 applications you open the most and preload the application code in memory when you boot up the computer!

In fact, there are countless variables that come into play: locality of reference (e.g. arrays vs. linked lists), effects of caches and memory bandwidth, compiler inlining, compiler implementation, compiler switches, number of processor cores, optimizations at the processor level, operating system schedulers, operating system background processes, etc.

So microbenchmarking isn't exactly a useful metric in a lot of cases. It definitely does not replace whole-program benchmarks with well-defined test cases (profiling). Write readable code first, then profile to see what needs to be done, if any.

I would like to emphasize that microbenchmarks are not evil per se, but one has to use them carefully (that's true for lots of other things related to computers)


  • What DOES it mean and what DOESN'T it mean?

I would say micro-benchmarking simply means measuring something tiny. Tiny is probably context dependent, but typically on the level of a single system call or something similar. Benchmarking refers to everything above.

  • What are some examples of what IS and ISN'T microbenchmarking?

This (archived) article lists measuring time of a getpid() system call and measuring the time to copy memory using memcpy() as examples of micro-benchmarking.

Any measurement of an algorithm implementation etc would not count as micro-benchmarking. Especially result reports listing tasks with decreasing execution time probably seldom count as micro benchmarking.

  • What are the dangers of microbenchmarking and how do you avoid it?

The obvious danger is that it tempts developers to optimize the wrong parts of a program. Another danger is that it is notoriously difficult to do measurements of something small accurately. The easiest way to avoid it is probably just to get a good picture of where the most time is spent in the program.

People usually say "don't do micro-benchmarking" but what they probably mean is "don't make optimization decisions based on micro-benchmarks".

  • (or is it a good thing?)

It's not at all a bad thing per se as others here, and many webpages seem to suggest. It has it's places. I work with program rewriting and runtime aspect weaving etc. We usually publish micro-benchmarks of our added instructions, not to guide any optimizations, but making sure that our extra code has close to no impact on the execution of the rewritten program.

It's an art however, especially in the context of a VM that has JIT, warm-up times etc. A well described approach for Java is descrbed here (archived).