Efficiently remove partial duplicates in a list of tuples

Let's conceptualize each tuple as a binary array, where 1 is "contains something" and 2 is "contains an empty string". Since the item at each position will be the same, we don't need to care what is at each position, only that something is.

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

Now, we can walk through that list and build a new datastructure without 'duplicates'. Since we have our tuples encoded as binary, we can determine a duplicate, 'encompassed' by another, by doing bitwise operations - given a and b, if a | b == a, then a must contain b.

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

Now we can withdraw our 'filtered' list of tuples:

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

Note that (A, B, C, '') contains both (A, B, '', '') and ('', B, '', ''), and that (A, '', '', D') contains ('', '', '', D), so this should be correct.

As of python 3.8, dict preserves insertion order, so the output should be in the same order that the tuples originally appeared in the list.

This solution wouldn't be perfectly efficient, since the number of codes might stack up, but it should be between O(n) and O(n^2), depending on the number of unique codes left at the end (and since the length of each tuple is significantly less than the length of l, it should be closer to O(n) than to O(n^2).

For that limit in particular, the obvious solution would be to convert each tuple to bit mask, accumulate them in a counter array, perform subset sum transformation, then filter the array l.

See detailed code explanation in the comment.

Time complexity is obviously n + m * 2^m, where n is the number of tuples and m is the length of each tuple. For n == 1000 and m == 10, this is obviously faster than n^2.

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)

I'm not sure if this is the most efficient or pythonic way, but this would be the straight-forward approach (again, maybe others will come with more sophisticated list-comprehension method):

take a look at this:

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):
    for item2comp in l:
        if item!=item2comp:
            found = True
            for part,rhs_part in zip(item, item2comp):
                if part!='' and part!=rhs_part:
                    found = False
                    break
            if found:
                return True
    return False
            
                
            
new_arr = []
for item in l:
    if not item_in_list(item, l):
        new_arr.append(item)
print(new_arr)

output:

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

time complexity as I see it is - O((N**2)*M)

N - number of elements in list

M - number of parts in each element

Efficiently remove partial duplicates in a list of tuples

Tags:

Python

Performance

List

Tuples

Related

Recent Posts