Select lines from text file which have ids listed in another file

I guess you meant grep -f not grep -F but you actually need a combination of both and -w:

grep -Fwf ids.csv table.csv

The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w removes this problem and -F makes sure your patterns are treated as strings, not regular expressions. From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

If your false positives are because an ID can be present in a non-ID field, loop through your file instead:

while read pat; do grep -w "^$pat" table.csv; done < ids.csv

or, faster:

xargs -I {} grep "^{}" table.csv < ids.csv

Personally, I would do this in perl though:

perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}} 
            print $_ if defined($k{$F[0]}); ' table.csv

The join utility is what you want. It does require the input files to be lexically sorted.

Assuming your shell is bash or ksh:

join -t $'\t' <(sort ids.csv) <(sort table.csv)

Without needing to sort, the usual awk solution is

awk -F '\t' 'NR==FNR {id[$1]; next} $1 in id' ids.csv table.csv

The answers to this SO question helped me get around the niggles with join. Essentially, when you sort the file in preparation to send it to join, you need to make sure you're sorting based on the column you're joining on. So if that's the first one, you need to tell it what the separator character is in the file and that you want it to sort on the first field (and only the first field). Otherwise if the first field has variable widths (for example), your separators and possibly other fields may start affecting the sort order.

So, use the -t option of sort to specify your separating character, and use the -k option to specify the field (remembering that you need a start and end field - even if it's the same - or it'll sort from that character to the end of the line).

So for a tab-separated file like in this question, the following should work (with thanks to glenn's answer for structure):

join -t$'\t' <(sort -d ids.csv) <(sort -d -t$'\t' -k1,1 table.csv) > output.csv

(For reference, the -d flag means dictionary sort. You might also want to use the -b flag to ignore leading whitespace, see man sort and man join).

As a more general example, suppose you're joining two comma-separated files - input1.csv on the third column and input2.csv on the fourth. You could use

join -t, -1 3 -2 4 <(sort -d -t, -k3,3 input2.csv) <(sort -d -t, -k4,4 input2.csv) > output.csv

Here the -1 and -2 options specify which fields to join on in the first and second input files respectively.

Select lines from text file which have ids listed in another file

Tags:

Csv

Shell

Related

Recent Posts