How do I use grep to find lines, in which any word occurs 3 times?
Using the standard word definition,
GNU Grep, 3 or more occurrences of any word.
grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file
GNU Grep, only 3 occurrences of any word.
grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file | grep -Ev '(\W|^)(\w+)\W(.*\<\2\>){3}'
POSIX Awk, only 3 occurences of any word.
awk -F '[^_[:alnum:]]+' '{ # Field separator is non-word sequences split("", cnt) # Delete array cnt for (i=1; i<=NF; i++) cnt[$i]++ # Count number of occurrences of each word for (i in cnt) { if (cnt[i]==3) { # If a word appears exactly 3 times print # Print the line break } } }' file
For 3 or more occurences, simply change
==
to>=
.Equivalent golfed one-liner:
awk -F '[^_[:alnum:]]+' '{split("",c);for(i=1;i<=NF;i++)c[$i]++;for(i in c)if(c[i]==3){print;next;}}' file
GNU Awk, only 3 occurrences of the word
ab
.gawk 'gsub(/\<ab\>/,"&")==3' file
For 3 or more occurences, simply change
==
to>=
.
Reading material
\2
is a back-reference.\w
\W
\<
\>
special expressions in GNU Grep.- The
[:alnum:]
POSIX character class.
Like this?
egrep '(\<.+\>).+\<\1\>.+\<\1\>'
egrep
(orgrep -E
) enables extended regexes, which are required for backreferences\<.+\>
will match any word of at least 1 character\<
resp\>
match word boundaries (in your attempt you didn't take word boundaries into account at all).+
matches a sequence of one or more characters (in your attempt you used.*
which matches a sequence of zero or more characters!)
- use back-references, to check whether the matched sequence occurs a 2nd time (
\1
) and a 3rd time (\1
again).- we allow any sequence of one or more characters (
.+
) between the matches, so "foo bar foo dorbs foo godly" will match (there's 3 occurences of the word "foo"). - if you only want to match adjacent words (e.g. "foo foo foo"), use something like
[[:space:]]+
instead.
- we allow any sequence of one or more characters (
I assume that your question means if any of the words in the line exists at least 3 times, then print the line, else discard it. I would use awk
, for a more readable and customizable solution:
awk -F '\\W+' '{
delete c; for (i=1;i<=NF;i++) if (length($i) && ++c[$i]==3) {print; next}
}' file
It is a loop for all fields, counting their occurences per line. If any word reaches 3 times, it will print the line, delete the array and go to next line. Also a test for the length of the field exists to avoid printing on any empty fields counted.
Here we can easily customize the meaning of "word" by adding different and/or many field separators, using -F
(the standard BREs and EREs are supported). In the above, word separators are all characters except _
and [:alnum:]
: awk -F '\\W+'
or awk -F '[^_[:alnum:]]+'
, similar to matching word bountaries with grep
.
For a human language, we may need different word bountaries, like everything except the letters, like: awk -F '[^[:alpha:]]+'
or except letters and digits: awk -F '[^[:alnum:]]+'
or to include not only the underscore, but the dash also into words: awk -F '[^-_[:alnum:]]+'
.
Without setting -F
, only the whitespace characters are used.