combine two arrays and sort

Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).

Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.

Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.

The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.

It would look vaguely like the following:

def foo(numpy.ndarray[int, ndim=1] out,
        numpy.ndarray[int, ndim=1] in1, 
        numpy.ndarray[int, ndim=1] in2):

        cdef int i = 0
        cdef int j = 0
        cdef int k = 0
        while (i!=len(in1)) or (j!=len(in2)):
            # set out[k] to smaller of in[i] or in[j]
            # increment k
            # increment one of i or j

Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:

  1. Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
  2. np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
  3. I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.

This would make the full thing close to what you got (without doing a unique beforehand):

def insort(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b)) # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]

Tags:

Python

Numpy