Select lines from text file which have ids listed in another file
I guess you meant grep -f
not grep -F
but you actually need a combination of both and -w
:
grep -Fwf ids.csv table.csv
The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w
removes this problem and -F
makes sure your patterns are treated as strings, not regular expressions. From man grep
:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
If your false positives are because an ID can be present in a non-ID field, loop through your file instead:
while read pat; do grep -w "^$pat" table.csv; done < ids.csv
or, faster:
xargs -I {} grep "^{}" table.csv < ids.csv
Personally, I would do this in perl
though:
perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}}
print $_ if defined($k{$F[0]}); ' table.csv
The join
utility is what you want. It does require the input files to be lexically sorted.
Assuming your shell is bash or ksh:
join -t $'\t' <(sort ids.csv) <(sort table.csv)
Without needing to sort, the usual awk solution is
awk -F '\t' 'NR==FNR {id[$1]; next} $1 in id' ids.csv table.csv
The answers to this SO question helped me get around the niggles with join. Essentially, when you sort the file in preparation to send it to join, you need to make sure you're sorting based on the column you're joining on. So if that's the first one, you need to tell it what the separator character is in the file and that you want it to sort on the first field (and only the first field). Otherwise if the first field has variable widths (for example), your separators and possibly other fields may start affecting the sort order.
So, use the -t option of sort to specify your separating character, and use the -k option to specify the field (remembering that you need a start and end field - even if it's the same - or it'll sort from that character to the end of the line).
So for a tab-separated file like in this question, the following should work (with thanks to glenn's answer for structure):
join -t$'\t' <(sort -d ids.csv) <(sort -d -t$'\t' -k1,1 table.csv) > output.csv
(For reference, the -d flag means dictionary sort. You might also want to use the -b flag to ignore leading whitespace, see man sort
and man join
).
As a more general example, suppose you're joining two comma-separated files - input1.csv
on the third column and input2.csv
on the fourth. You could use
join -t, -1 3 -2 4 <(sort -d -t, -k3,3 input2.csv) <(sort -d -t, -k4,4 input2.csv) > output.csv
Here the -1
and -2
options specify which fields to join on in the first and second input files respectively.