Efficiently remove partial duplicates in a list of tuples
Let's conceptualize each tuple as a binary array, where 1 is "contains something" and 2 is "contains an empty string". Since the item at each position will be the same, we don't need to care what is at each position, only that something is.
l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent
Now, we can walk through that list and build a new datastructure without 'duplicates'. Since we have our tuples encoded as binary, we can determine a duplicate, 'encompassed' by another, by doing bitwise operations - given a
and b
, if a | b == a
, then a
must contain b
.
codes = {}
for tup, b in zip(l, l_bin):
# check if any existing code contains the potential new one
# in this case, skip adding the new one
if any(a | b == a for a in codes):
continue
# check if the new code contains a potential existing one or more
# in which case, replace the existing code(s) with the new code
for a in list(codes):
if b | a == b:
codes.pop(a)
# and finally, add this code to our datastructure
codes[b] = tup
Now we can withdraw our 'filtered' list of tuples:
output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]
Note that (A, B, C, '')
contains both (A, B, '', '')
and ('', B, '', '')
, and that (A, '', '', D')
contains ('', '', '', D)
, so this should be correct.
As of python 3.8, dict
preserves insertion order, so the output should be in the same order that the tuples originally appeared in the list.
This solution wouldn't be perfectly efficient, since the number of codes might stack up, but it should be between O(n) and O(n^2), depending on the number of unique codes left at the end (and since the length of each tuple is significantly less than the length of l
, it should be closer to O(n) than to O(n^2).
For that limit in particular, the obvious solution would be to convert each tuple to bit mask, accumulate them in a counter array, perform subset sum transformation, then filter the array l
.
See detailed code explanation in the comment.
Time complexity is obviously n + m * 2^m
, where n
is the number of tuples and m
is the length of each tuple. For n == 1000
and m == 10
, this is obviously faster than n^2
.
l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.
def tuple_to_mask(tuple_):
# convert the information whether each value in (tuple_) is empty to a bit mask
# (1 is empty, 0 is not empty)
return sum((value == '') << index for index, value in enumerate(tuple_))
count = [0] * (1 << len(l[0]))
for tuple_ in l:
# tuple_ is a tuple.
count[tuple_to_mask(tuple_)] += 1
# now count[mask] is the number of tuples in l with that mask
# transform the count array.
for dimension in range(len(l[0])):
for mask in range(len(count)):
if mask >> dimension & 1:
count[mask] += count[mask - (1 << dimension)]
# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)
filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)
I'm not sure if this is the most efficient or pythonic way, but this would be the straight-forward approach (again, maybe others will come with more sophisticated list-comprehension method):
take a look at this:
l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
def item_in_list(item, l):
for item2comp in l:
if item!=item2comp:
found = True
for part,rhs_part in zip(item, item2comp):
if part!='' and part!=rhs_part:
found = False
break
if found:
return True
return False
new_arr = []
for item in l:
if not item_in_list(item, l):
new_arr.append(item)
print(new_arr)
output:
[('A', 'B', 'C', ''), ('A', '', '', 'D')]
time complexity as I see it is - O((N**2)*M)
N - number of elements in list
M - number of parts in each element