Python equivalent of MATLAB's "ismember" function
Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:
def ismember(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B))
. The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B))
. If this is still too slow, then worry about making the above function run on multiple cores.
Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this
A = [2378, 2378, 2378, 2378]
B = [2378, 2379]
and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None
for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None
will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None)
expression to the value you want returned.
sfstewman's excellent answer most likely solved the issue for you.
I'd just like to add how you can achieve the same exclusively in numpy.
I make use of numpy's unique an in1d functions.
B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
B_unique_sorted
contains the unique values inB
sorted.B_idx
holds for these values the indices into the originalB
.B_in_A_bool
is a boolean array the size ofB_unique_sorted
that stores whether a value inB_unique_sorted
is inA
.
Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect toB_idx
Note: I assume thatA
is already unique.
Now you can use B_in_A_bool
to either get the common vals
B_unique_sorted[B_in_A_bool]
and their respective indices in the original B
B_idx[B_in_A_bool]
Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.