Removal of lines with no more or fewer than 'N' fields?

You almost have it already:

awk -F'\t' 'NF==13 {print}' infile  > newfile

And, if you're on one of those systems where you're charged by the keystroke ( :) ) you can shorten that to

awk -F'\t' 'NF==13' infile  > newfile

To do multiple files in one sweep, and to actually change the files (and not just create new files), identify a filename thats not in use (for example, scharf), and perform a loop, like this:

for f in list
do
    awk -F'\t' 'NF==13 {print}' "$f" > scharf  &&  mv -f -- scharf "$f"
done

The list can be one or more filenames and/or wildcard filename expansion patterns; for example,

for f in blue.data green.data *.dat orange.data red.data /ultra/violet.dat

The mv command overwrites the input file (e.g., blue.data) with the temporary scharf file (which has only the lines from the input file with 13 fields). (Be sure this is what you want to do, and be careful. To be safe, you should probably back up your data first.) The -f tells mv to overwrite the input file, even though it already exists. The -- protects you against weirdness if any of your files has a name beginning with -.

Since this is a large file, it may be worth using a slightly more complex tool for a performance gain. Usually, specialized tools are faster than generalist tools. For example, solving the same problem with cut tends to be faster than grep which tends to be faster than sed which tends to be faster than awk (the flip side being that later tools can do things that earlier ones can't).

You want to remove lines with 13 tab characters or more, so:

LC_ALL=C grep -Ev '(␉.*){13}'

or maybe (I don't expect a measurable performance difference)

LC_ALL=C grep -Ev '(␉.*){12}␉'

where ␉ is a literal tab character. Setting the locale to C isn't necessary, but speeds up some versions of GNU grep compared with multibyte locales.

Removal of lines with no more or fewer than 'N' fields?

Tags:

Awk

Sed

Text Processing

Columns

Related

Recent Posts