unexplained 10%+ performance boost from simply adding a method argument (slimmer jit code)
Reproduction and Analysis
I was able to reproduce your results. Machine data:
Linux #143-Ubuntu x86_64 GNU/Linux
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
I rewrote your code a bit and I have done some additional testing. Your test time includes the System.arraycopy()
call. I created a 400Mb array structure and saved it:
int[][][] data = new int[iterations][testCases][];
for (int iteration = 0; iteration < data.length; iteration++) {
for (int testcase = 0; testcase < data[iteration].length; testcase++) {
data[iteration][testcase] = random.ints(numberCount, 0, bound).toArray();
}
}
FileOutputStream fos = new FileOutputStream("test_array.dat");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(data);
After that I have run this tests (warmup, teardown run as well):
{
FileInputStream fis = new FileInputStream(fileName);
ObjectInputStream iis = new ObjectInputStream(fis);
int[][][] data = (int[][][]) iis.readObject();
perf("qs2", () -> {
for (int iteration = 0; iteration < data.length; iteration++) {
for (int testCase = 0; testCase < data[iteration].length; testCase++) {
quicksort2(data[iteration][testCase], 0, data[iteration][testCase].length - 1);
}
}
return null;
});
}
{
FileInputStream fis = new FileInputStream(fileName);
ObjectInputStream iis = new ObjectInputStream(fis);
int[][][] data = (int[][][]) iis.readObject();
perf("qs1", () -> {
for (int iteration = 0; iteration < data.length; iteration++) {
for (int testCase = 0; testCase < data[iteration].length; testCase++) {
quicksort1(data[iteration][testCase], 0, data[iteration][testCase].length - 1, und);
}
}
return null;
});
}
In case I run the qs1 and qs2 together:
main]: qs1: 6646.219874 ms (res=null)
main]: qs2: 7418.376646 ms (res=null)
The result is not execution order dependent:
main]: qs2: 7526.215395 ms (res=null)
main]: qs1: 6624.261529 ms (res=null)
I have run the code in new JVM instances as well:
Instance one:
main]: qs1: 6592.699738 ms (res=null)
Instance two:
main]: qs2: 7456.326028 ms (res=null)
If you try it without the JIT:
-Djava.compiler=NONE
The results are as "expected" (the smaller bytecode is faster):
main]: qs1: 56547.589942 ms (res=null)
main]: qs2: 53585.909246 ms (res=null)
For better analysis I extracted the codes to two different classes.
I was using jclasslib for bytecode inspection. The method lengths for me:
Q1: 505
Q2: 480
This makes sense for the execution without the JIT:
53585.909246×505÷480 = 56376.842019229
Which is really close to 56547.589942.
Reason
For me in the compilation output (using -XX:+PrintCompilation
) I have these lines
1940 257 2 QS1::sort (185 bytes)
1953 258 % 4 QS1::sort @ 73 (185 bytes)
1980 259 4 QS1::sort (185 bytes)
1991 257 2 QS1::sort (185 bytes) made not entrant
9640 271 3 QS2::sort (178 bytes)
9641 272 4 QS2::sort (178 bytes)
9654 271 3 QS2::sort (178 bytes) made not entrant
Where the % means on stack replacement (where the compiled code is running). According to this log the call with the extra String parameter gets optimized and the second is not. I was thinking on better branch prediction, but this should not be the case here (tried to add randomli generated Strings as parameters). The sample sizes (400Mb) mostly rule out caching. I thought on optimization treshold, but when I use these options -Xcomp -XX:+PrintCompilation -Xbatch
the output is the following:
6408 3254 b 3 QS1::sort (185 bytes)
6409 3255 b 4 QS1::sort (185 bytes)
6413 3254 3 QS1::sort (185 bytes) made not entrant
14580 3269 b 3 QS2::sort (178 bytes)
14580 3270 b 4 QS2::sort (178 bytes)
14584 3269 3 QS2::sort (178 bytes) made not entrant
This means that the metods are fored blocking compiled before called but the times remain the same:
main]: qs1: 6982.721328 ms (res=null)
main]: qs2: 7606.077812 ms (res=null)
The key to this I think is the String
. In case I change the extra (unused) parameter to int
it is consistently processed slightly slower (running with the previous optimization parameters):
main]: qs1: 7925.472909 ms (res=null)
main]: qs2: 7727.628422 ms (res=null)
My conclusion is that the optimization may be influenced by the extra parameters object type. Probably there is less eager optimization in case of primitives which makes sense to me, but I could not find exact source for that claim.
An additional interesting read.