How to count the lines containing one of two words but not both
perl -nE 'END {say $c+0} ++$c if /\bthe\b/i xor /\ban\b/i' file
gawk 'END {print c+0} /\<the\>/ != /\<an\>/ {++c}' IGNORECASE=1 file
Comparing the results from matching each expression can give the outcome you want.
For example, the result of matching \<the\>
may be either 0 or 1. If the result of the other match is the same, then both regexps were either found or not found, and the line should not be counted. If they differ it means that one match was found and the other was not, so the counter is incremented.
gawk has a built-in xor()
function:
gawk 'END {print c+0} xor(/\<the\>/,/\<an\>/) {++c}' IGNORECASE=1 file
With grep:
cat poem.txt \
| grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
| grep -Eci -e '\<(an|the)\>'
This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.
Breakdown:
The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.
If you remove the c
from the second grep's -Eci
, you will see all matches highlighted.
Details:
The
-E
option enables extended expression syntax (ERE) for grep.The
-i
option tells grep to match case-insensitiveThe
-v
option tells grep to invert the result (i.e. match lines not containing the pattern)The
-c
option tells grep to output the number of matched lines instead of the lines themselvesThe patterns:
\<
matches the beginning of a word (thanks @glenn-jackman)\>
matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
grep -Evi -e '\<an\>.*\<the\>'
thus matches all lines not containing 'an ... the'Similarly,
grep -Evi -e '\<the\>.*\<an\>'
matches all lines not containing 'the ... an'grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>'
is the combination of the 3. and 4.grep -Eci -e '\<(an|the)\>'
matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines
EDIT 1: Use \<
and \>
instead of ( |^)
and ( |$)
, as suggested by @glenn-jackman
EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:
cat poem.txt \
| grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
| grep -Eio -e '\<(an|the)\>' \
| wc -l
This uses the -o
option of grep, which prints every match in a separate line (and nothing else) and then wc -l
to count the lines.
The following GNU awk
program should do the trick:
awk '(/(^|\W)[Tt]he(\W|$)/ && !/(^|\W)[Aa]n(\W|$)/) || (/(^|\W)[Aa]n(\W|$)/ && !/(^|\W)[Tt]he(\W|$)/) {c++} END{print c}' poem.txt
This will increase the counter c
, if either
- the line matches
(^|\W)[Tt]he(\W|$)
(first-letter-case-insensitivethe
, preceded by non-word constituent (\W
) or begin of line (^
), and followed by non-word constituent (\W
) or end-of line ($
)) but not(^|\W)[Aa]n(\W|$)
(the isolated first-letter-case-insensitivean
) - OR - - the line matches
(^|\W)[Aa]n(\W|$)
but not(^|\W)[Tt]he(\W|$)
In the end, print the value of c
.
It can be formulated slightly shorter using \<
and \>
for "beginning-of-word" and "end-of-word":
awk '(/\<[Tt]he\>/ && !/\<[Aa]n\>/) || (/\<[Aa]n\>/ && !/\<[Tt]he\>/) {c++} END{print c}' poem.txt
Even shorter would be:
awk '/\<[Tt]he\>/ != /\<[Aa]n\>/ {c++} END{print c}' poem.txt
as the inequality is only ever true if either, but not both (nor none) of an
and the
are present on a line.
This approach requires GNU awk
because the \W
and \<
/ \>
constructs are GNU extensions to the extended regular expression syntax (but \<
/ \>
are also understood by BSD regexes).
Notice that the pipeline construct you showed in your own attempted solution won't work, as calling grep
with a file as input parameter supersedes reading from stdin, so the first part of the pipeline would simply vanish unnoticed, with the output being entirely due to the last part (which looks for occurences of an
, even those embedded in other words).