Why does numpy not short-circuit on non-contiguous arrays?
I got interested in solving this problem. So I`ve come with the next solution that manages to avoid the "a[::-1]
" problem case due to internal ndarray copies by np.argmax
:
I created a small library that implements a function argmax
which is a wrapper of np.argmax
, but it has increased performance when the input argument is a 1D boolean array with stride value set to -1:
https://github.com/Vykstorm/numpy-bool-argmax-ext
For those cases, it uses a low-level C routine to find the index k
of an item with maximum value (True
), starting from the end to the beginning of the array a
.
Then you can compute argmax(a[::-1])
with len(a)-k-1
The low-level method doesn't perform any internal ndarray copies because it operates with the array a
which is already C-contiguous and aligned in memory. It also applies short-circuit
EDIT:
I extended the library to improve the performance argmax
also when dealing with stride values different than -1 (with 1D boolean arrays) with good results: a[::2]
, a[::-3]
, e.t.c.
Give it a try.
The problem is related with the memory alignment of the array when using strides.
Either a[1:-1]
, a[::-1]
are considered aligned in memory but a[::2]
dont:
a = np.random.randint(0,2,1000000,bool)
print(a[1:-1].flags.c_contiguous) # True
print(a[::-1].flags.c_contiguous) # False
print(a[::2].flags.c_contiguous) # False
This explains why np.argmax
is slow on a[::2]
(from documentation on ndarrays):
Several algorithms in NumPy work on arbitrarily strided arrays. However, some algorithms require single-segment arrays. When an irregularly strided array is passed in to such algorithms, a copy is automatically made.
np.argmax(a[::2])
is making a copy of the array. So if you do timeit(lambda: np.argmax(a[::2]), number=5000)
you are timing 5000 copies of the array a
Execute this and compare the results of this two timing calls:
print(timeit(lambda: np.argmax(a[::2]), number=5000))
b = a[::2].copy()
print(timeit(lambda: np.argmax(b), number=5000))
EDIT:
Digging into the source code in C of numpy, i found the underline implementation of argmax
function, PyArray_ArgMax which calls at some point to PyArray_ContiguousFromAny to ensure that the given input array is aligned in memory (C-style)
Then, if the dtype of the array is bool, it delegates to BOOL_argmax function. Looking at its code, seems that short-ciruit is always applied.
Summary
- In order to avoid copies by
np.argmax
, make sure that the input array is contiguous in memory - short-circuit is always applied when data type is boolean.