jmh indicates that M1 is faster than M2 but M1 delegates to M2

In this particular case assertMethod is indeed compiled better than requireMethod due to register allocation issues.

The benchmark looks correct, and I can consistently reproduce your results.
To analyze the problem I've made the simplified benchmark:

package bench;

import com.google.common.collect.ImmutableMap;
import org.openjdk.jmh.annotations.*;

@State(Scope.Benchmark)
public class Requirements {
    private static boolean enabled = true;

    private String name = "name";
    private String value = "value";

    @Benchmark
    public Object assertMethod() {
        if (enabled)
            return requireThat(value, name);
        return null;
    }

    @Benchmark
    public Object requireMethod() {
        return requireThat(value, name);
    }

    public static Object requireThat(String parameter, String name) {
        if (name.trim().isEmpty())
            throw new IllegalArgumentException();
        return new StringRequirementsImpl(parameter, name, new Configuration());
    }

    static class Configuration {
        private Object context = ImmutableMap.of();
    }

    static class StringRequirementsImpl {
        private String parameter;
        private String name;
        private Configuration config;
        private ObjectRequirementsImpl asObject;

        StringRequirementsImpl(String parameter, String name, Configuration config) {
            this.parameter = parameter;
            this.name = name;
            this.config = config;
            this.asObject = new ObjectRequirementsImpl(parameter, name, config);
        }
    }

    static class ObjectRequirementsImpl {
        private Object parameter;
        private String name;
        private Configuration config;

        ObjectRequirementsImpl(Object parameter, String name, Configuration config) {
            this.parameter = parameter;
            this.name = name;
            this.config = config;
        }
    }
}

First of all, I've verified by -XX:+PrintInlining that the whole benchmark is inlined into one big method. Obviously this compilation unit has lots of nodes, and there are not enough CPU registers to hold all the intermediate variables. That is, compiler needs to spill some of them.

In assertMethod 4 registers are spilled to the stack before the call to trim().
In requireMethod 7 registers are spilled later, after the call to new Configuration().

-XX:+PrintAssembly output:

  assertMethod             |  requireMethod
  -------------------------|------------------------
  mov    %r11d,0x5c(%rsp)  |  mov    %rcx,0x20(%rsp)
  mov    %r10d,0x58(%rsp)  |  mov    %r11,0x48(%rsp)
  mov    %rbp,0x50(%rsp)   |  mov    %r10,0x30(%rsp)
  mov    %rbx,0x48(%rsp)   |  mov    %rbp,0x50(%rsp)
                           |  mov    %r9d,0x58(%rsp)
                           |  mov    %edi,0x5c(%rsp)
                           |  mov    %r8,0x60(%rsp)

This is almost the only difference between two compiled methods in addition to if (enabled) check. So, the performance difference is explained by more variables spilled to memory.

Why the smaller method is compiled less optimal then? Well, the register allocation problem is known to be NP-complete. Since it cannot be solved ideally in reasonable time, compilers usually rely on certain heuristics. In a big method a tiny thing like an extra if may significantly change the result of register allocation algorithm.

However you don't need to worry about that. The effect we've seen does not mean that requireMethod is always compiled worse. In other use cases the compilation graph will be completely different due to inlining. Anyway, 1 nanosecond difference is nothing for the real application performance.

You are running your test within a single VM process by specificing forks(1). During runtime, a virtual machine looks at your code and tries to figure out how itis actually executed. It then creates so-called profiles to optimize your application according to this observed behavior.

What most likely happens here is called profile pollution where running the first benchmark has an effect on the outcome of the second benchmark. Overly simpflified: if your VM was trained to do (a) very well by running its benchmark, it takes some additional time for it to get used to doing (b) afterwards. Therefore, (b) appears to take more time.

In order to avoid this, run your benchmark with multiple forks where the different benchmarks are run on fresh VM processes in order to avoid such profile polution. You can read more about forking in the samples that are provided by JMH.

You should also check the sample on state; you should not refer to your input as constants but let JMH handle the value's escape in order to apply an actual computation.

I guess that - if applied properly - both benchmarks would yield similar runtime.

Update - Here is what I get for the fixed benchmark:

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.assertMethod   avgt   40  17,592 ± 1,493  ns/op
MyBenchmark.requireMethod  avgt   40  17,999 ± 0,920  ns/op

For the sake of completion, I also ran the benchmark with perfasm and both methods are basically compiled into the same thing.

jmh indicates that M1 is faster than M2 but M1 delegates to M2

Tags:

Java

Performance Testing

Jmh

Related

Recent Posts