How do I use grep to find lines, in which any word occurs 3 times?

Using the standard word definition,

GNU Grep, 3 or more occurrences of any word.

grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file

GNU Grep, only 3 occurrences of any word.

grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file | grep -Ev '(\W|^)(\w+)\W(.*\<\2\>){3}'

POSIX Awk, only 3 occurences of any word.

awk -F '[^_[:alnum:]]+' '{           # Field separator is non-word sequences
    split("", cnt)                   # Delete array cnt
    for (i=1; i<=NF; i++) cnt[$i]++  # Count number of occurrences of each word
    for (i in cnt) {
        if (cnt[i]==3) {             # If a word appears exactly 3 times
            print                    # Print the line
            break
        }
    }
}' file

For 3 or more occurences, simply change == to >=.

Equivalent golfed one-liner:

awk -F '[^_[:alnum:]]+' '{split("",c);for(i=1;i<=NF;i++)c[$i]++;for(i in c)if(c[i]==3){print;next;}}' file

GNU Awk, only 3 occurrences of the word ab.
```
gawk 'gsub(/\<ab\>/,"&")==3' file
```
For 3 or more occurences, simply change == to >=.

Reading material

\2 is a back-reference.
\w \W \< \> special expressions in GNU Grep.
The [:alnum:] POSIX character class.

Like this?

egrep '(\<.+\>).+\<\1\>.+\<\1\>'

egrep (or grep -E) enables extended regexes, which are required for backreferences
\<.+\> will match any word of at least 1 character
- \< resp \> match word boundaries (in your attempt you didn't take word boundaries into account at all)
- .+ matches a sequence of one or more characters (in your attempt you used .* which matches a sequence of zero or more characters!)
use back-references, to check whether the matched sequence occurs a 2nd time (\1) and a 3rd time (\1 again).
- we allow any sequence of one or more characters (.+) between the matches, so "foo bar foo dorbs foo godly" will match (there's 3 occurences of the word "foo").
- if you only want to match adjacent words (e.g. "foo foo foo"), use something like [[:space:]]+ instead.

I assume that your question means if any of the words in the line exists at least 3 times, then print the line, else discard it. I would use awk, for a more readable and customizable solution:

awk -F '\\W+' '{
    delete c; for (i=1;i<=NF;i++) if (length($i) && ++c[$i]==3) {print; next}
}' file

It is a loop for all fields, counting their occurences per line. If any word reaches 3 times, it will print the line, delete the array and go to next line. Also a test for the length of the field exists to avoid printing on any empty fields counted.

Here we can easily customize the meaning of "word" by adding different and/or many field separators, using -F (the standard BREs and EREs are supported). In the above, word separators are all characters except _ and [:alnum:]: awk -F '\\W+' or awk -F '[^_[:alnum:]]+', similar to matching word bountaries with grep.

For a human language, we may need different word bountaries, like everything except the letters, like: awk -F '[^[:alpha:]]+' or except letters and digits: awk -F '[^[:alnum:]]+' or to include not only the underscore, but the dash also into words: awk -F '[^-_[:alnum:]]+'.

Without setting -F, only the whitespace characters are used.

How do I use grep to find lines, in which any word occurs 3 times?

Tags:

Grep

Regular Expression

Related

Recent Posts