combine two arrays and sort
Inserting elements into the middle of an array
is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect
. The complexity of doing so would be around O(N^2)
.
Your current approach is O(n*log(n))
, so that's a lot better, but it's not perfect.
Inserting all the elements into a hash table (such as a set
) is something. That's going to take O(N)
time for uniquify, but then you need to sort which will take O(n*log(n))
. Still not great.
The real O(N)
solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy
nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out,
numpy.ndarray[int, ndim=1] in1,
numpy.ndarray[int, ndim=1] in2):
cdef int i = 0
cdef int j = 0
cdef int k = 0
while (i!=len(in1)) or (j!=len(in2)):
# set out[k] to smaller of in[i] or in[j]
# increment k
# increment one of i or j
Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:
- Do not use
np.sort
, usec.sort()
method instead which sorts the array in place and avoids the copy. np.unique
must usenp.sort
which is not in place. So instead of usingnp.unique
do the logic by hand. IE. first sort (in-place) then do thenp.unique
method by hand (check also its python code), withflag = np.concatenate(([True], ar[1:] != ar[:-1]))
with whichunique = ar[flag]
(with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie.flag = np.ones(len(ar), dtype=bool)
and thennp.not_equal(ar[1:], ar[:-1], out=flag[1:])
which avoids basically one full copy offlag
.- I am not sure about this. But
.sort
has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]