What is the most efficient way to compute the difference of lines from two files?

To extend the comment of @L3viathan If order of element is not important set is the rigth way. here a dummy example you can adapt:

Click to copy

l1 = [0,1,2,3,4,5]
l2 = [3,4,5]
setL1 = set(l1)  # transform the list into a set
setL2 = set(l2)
setDiff = setl1 - setl2  # make the difference 
listeDiff = list(setDiff)  # if you want to have your element back in a list

as you see is pretty straightforward in python.

Try using sets:

Click to copy

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    set_b = set(f)

set_c = set_a - set_b

with open("list_c.txt","w") as f:
    for c in set_c:
        f.write(c)

The complexity of subtracting two sets is O(n) in the size of the set a.

you can create one set of the first file contents, then just use difference or symmetric_difference depending on what you call a difference

Click to copy

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    diffs = set_a.difference(f)

if list_b.txt contains more items than list_a.txt you want to swap them or use set_a.symmetric_difference(f) instead, depending on what you need.

difference(f) works but still has to construct a new set internally. Not a great performance gain (see set issubset performance difference depending on the argument type), but it's shorter.

In case order matters you can presort the lists together with item indices and then iterate over them together:

Click to copy

list_2 = sorted(list_2)
diff_idx = []
j = 0
for i, x in sorted(enumerate(list_1), key=lambda x: x[1]):
    if x != list_2[j]:
        diff_idx.append(i)
    else:
        j += 1
diff = [list_1[i] for i in sorted(diff_idx)]

This has time complexity of the sorting algorithm, i.e. O(n*log n).

What is the most efficient way to compute the difference of lines from two files?

Tags:

Python

Performance

List

Python 3.X

Difference

Related

Recent Posts