Find lines from a file which are not present in another file
The command you have to use is not diff
but comm
comm -23 a.txt b.txt
By default, comm
outputs 3 columns: left-only, right-only, both. The -1
, -2
and -3
switches suppress these columns.
So, -23
hides the right-only and both columns, showing the lines that appear only in the first (left) file.
If you want to find lines that appear in both, you can use -12
, which hides the left-only and right-only columns, leaving you with just the both column.
The simple answer did not work for me because I didn't realize comm
matches line for line, so duplicate lines in one file will be printed as not-existing in the other. For example, if file1 contained:
Alex
Bill
Fred
And file2 contained:
Alex
Bill
Bill
Bill
Fred
Then comm -13 file1 file2
would output:
Bill
Bill
In my case, I wanted to know only that every string in file2 existed in file1, regardless of how many times that line occurred in each file.
Solution 1: use the -u
(unique) flag to sort
:
comm -13 <(sort -u file1) <(sort -u file2)
Solution 2: (the first "working" answer I found) from unix.stackexchange:
fgrep -v -f file1 file2
Note that if file2 contains duplicate lines that don't exist at all in file1, fgrep
will output each of the duplicate lines. Also note that my totally non-scientific tests on a single laptop for a single (fairly large) dataset showed Solution 1 (using comm
) to be almost 5 times faster than Solution 2 (using fgrep
).