Is there a tool to get the lines in one file that are not in another?
Yes. The standard grep
tool for searching files for text strings can be used to subtract all the lines in one file from another.
grep -F -x -v -f fileB fileA
This works by using each line in fileB as a pattern (-f fileB
) and treating it as a plain string to match (not a regular regex) (-F
). You force the match to happen on the whole line (-x
) and print out only the lines that don't match (-v
). Therefore you are printing out the lines in fileA that don't contain the same data as any line in fileB.
The downside of this solution is that it doesn't take line order into account and if your input has duplicate lines in different places you might not get what you expect. The solution to that is to use a real comparison tool such as diff
. You could do this by creating a diff file with the context value at 100% of the lines in the file, then parsing it for just the lines that would be removed if converting file A to file B. (Note this command also removes the diff formatting after it gets the right lines.)
diff -U $(wc -l < fileA) fileA fileB | sed -n 's/^-//p' > fileC
The answer depends a great deal on the type and format of the files you are comparing.
If the files you are comparing are sorted text files, then the GNU tool written by Richard Stallman and Davide McKenzie called comm
may perform the filtering you are after. It is part
of the coreutils.
Example
Say you have the following 2 files:
$ cat a
1
2
3
4
5
$ cat b
1
2
3
4
5
6
Lines in file b
that are not in file a
:
$ comm <(sort a) <(sort b) -3
6
from stackoverflow...
comm -23 file1 file2
-23 suppresses the lines in file2 (-2) and the lines that appear in both (-3), leaving only the unique lines from file1. The files have to be sorted (they are in your example) but if not, pipe them through sort first.
See the man page here
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)