What is the most efficient way to compute the difference of lines from two files?
To extend the comment of @L3viathan If order of element is not important set is the rigth way. here a dummy example you can adapt:
l1 = [0,1,2,3,4,5]
l2 = [3,4,5]
setL1 = set(l1) # transform the list into a set
setL2 = set(l2)
setDiff = setl1 - setl2 # make the difference
listeDiff = list(setDiff) # if you want to have your element back in a list
as you see is pretty straightforward in python.
Try using sets:
with open("list_a.txt") as f:
set_a = set(f)
with open("list_b.txt") as f:
set_b = set(f)
set_c = set_a - set_b
with open("list_c.txt","w") as f:
for c in set_c:
f.write(c)
The complexity of subtracting two sets is O(n) in the size of the set a.
you can create one set of the first file contents, then just use difference
or symmetric_difference
depending on what you call a difference
with open("list_a.txt") as f:
set_a = set(f)
with open("list_b.txt") as f:
diffs = set_a.difference(f)
if list_b.txt
contains more items than list_a.txt
you want to swap them or use set_a.symmetric_difference(f)
instead, depending on what you need.
difference(f)
works but still has to construct a new set
internally. Not a great performance gain (see set issubset performance difference depending on the argument type), but it's shorter.
In case order matters you can presort the lists together with item indices and then iterate over them together:
list_2 = sorted(list_2)
diff_idx = []
j = 0
for i, x in sorted(enumerate(list_1), key=lambda x: x[1]):
if x != list_2[j]:
diff_idx.append(i)
else:
j += 1
diff = [list_1[i] for i in sorted(diff_idx)]
This has time complexity of the sorting algorithm, i.e. O(n*log n).