Keep only the lines containing exact number of delimiters
Another POSIX one:
awk -F , 'NF == 11' <file
If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk
use ,
as the field delimiter. If the number of fields is 11, the condition NF == 11
is true, awk
then performs the default action print $0
.
Using egrep
(or grep -E
in POSIX):
egrep "^([^,]*,){10}[^,]*$" file.csv
This filters out anything not containing 10 commas: it matches full lines (^
at the start and $
at the end), containing exactly ten repetitions ({10}
) of the sequence "any number of characters except ',', followed by a single ','" (([^,]*,)
), followed again by any number of characters except ',' ([^,]*
).
You can also use the -x
parameter to drop the anchors:
grep -xE "([^,]*,){10}[^,]*" file.csv
This is less efficient than cuonglm's awk
solution though; the latter is typically six times faster on my system for lines with around 10 commas. Longer lines will cause huge slowdowns.
The simplest grep
code that will work:
grep -xE '([^,]*,){10}[^,]*'
Explanation:
-x
ensures that the pattern must match the entire line, rather than just part of it. This is important so you don't match lines with more than 10 commas.
-E
means "extended regex", which makes for less backslash-escaping in your regex.
Parentheses are used for grouping, and the {10}
afterwards means there must be exactly ten matches in a row of the pattern within the parantheses.
[^,]
is a character class—for instance, [c-f]
would match any single character that is a c
, a d
, an e
or an f
, and [^A-Z]
would match any single character that is NOT an uppercase letter. So [^,]
matches any single character except a comma.
The *
after the character class means "zero or more of these."
So the regex part ([^,]*,)
means "Any character except a comma any number of times (including zero times), followed by a comma" and the {10}
specifies 10 of these. Then [^,]*
to match the rest of the non-comma characters to the end of the line.