Why does MATLAB's element-wise exponentiation speed up for 512 elements?

I see the same effect using random numbers for the exponent, as I see using integers in the range 1:n:

x = 500:540;
t = zeros(size(x));
for ii = 1:numel(x)
    %m = 1:x(ii);
    m = 500*rand(1,x(ii));
    t(ii) = timeit(@()power(2,m));
end
plot(x,t)

graph showing jump down in execution time around 512

When forcing MATLAB to use a single thread with maxNumCompThreads(1), and running the code above again, I see this graph instead (note the y-axis, the peaks are just noise):

graph not showing the jump at 512

It looks to me that MATLAB uses a single core to compute the exponent of 511 values, and fires up all cores if the matrix is larger. There is an overhead in using multithreading, it is not worth while to do so for small arrays. The exact point where the overhead is balanced by the time savings depends on many factors, and so hard-coding a fixed threshold for when to switch to multithreaded computation leads to a jump in execution time on systems with different characteristics to those of the system where the threshold was determined.

Note that @norok2 is not seeing this same jump because on their system MATLAB was limited to a single thread.


This is related to size of the number for which the power is computed rather than the size of the container.

If you use random numbers, for varying container size, one does not observe a jump in the timings:

x = 450:1550;
y = zeros(numel(x), 1);
X = rand(1, 10000);
for i = 1:length(x)
    f = @() 2 .^ X(1:x(i));
    y(i) = timeit(f);
end

figure()
plot(x, y)

power_test_1

Therefore the issue must be with the computation for very large numbers. I first I thought that this might be related to overflow, but the overflow happens at 2 ^ 1024 == inf as dictated by the IEEE standards which MATLAB follows, and I thought that for inf this is would have been much faster than computing a number for real.

This is supported by the following benchmark where the size of the array is kept constant:

x = 450:1550;
y = zeros(numel(x), 1);
X = rand(1, 10000);
for i = 1:length(x)
    f = @() 2 .^ (ones(1, 500) * x(i));
    y(i) = timeit(f);
end

figure()
plot(x, y)

power_test_2

Why exactly this may be relevant for your setup when 2 ^ 512 instead of 2 ^ 1024, I do not really understand.

(Note that I used 2 .^ ... instead of power(2, ...) but the results are the same.)


Also, running @CrisLuengo's code in my system does not really reproduce any jump.

x = 500:540;
t = zeros(size(x));
for ii = 1:numel(x)
    %m = 1:x(ii);
    m = 500*rand(1,x(ii));
    t(ii) = timeit(@()power(2,m));
end
plot(x,t)

all the evidence so far indicate that spike being a related to JIT latency/warm-up.

enter image description here


Here's some confirmation of what Cris found, using a 4-core Windows machine running MATLAB R2018a. I first tested the following code to show that the specific value of the exponent wasn't the culprit for the jump:

t = zeros(4, 1000);
for p = 1:size(t, 1)
  for n = 1:size(t, 2)
    t(p, n) = timeit(@() power(2, (2.^(p-1)).*ones(1, n)));
  end
end

And here are the results:

enter image description here

For degenerate edge cases where the exponent is 1 (return the same value) or 2 (return the value times itself) the computation runs faster, as expected. However, the jump at an array size of 512 or above indicates overhead is added to these edge cases, compared to a reduction in computation time for exponents of 4 and 8 when the array size exceeds 512. Larger exponent values simply reproduce the upper curves.

I then ran two more tests: one with array size between 1 and 511, and a second with array size between 512 and 1024. Here's what the processor load looked like:

enter image description here

Processor 3 shows a large load spike during the first test, while all 4 processors show load spikes during the second test. This confirms that multithreading is employed for array sizes of 512 or above. This also explains the slower computation for edge cases of larger sizes, since the overhead from multithreading outweighs the speedup provided by splitting up the simpler calculations.