Find repeated words in a text

With GNU grep:

echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:

twice twice
as as
here here
123 123

Options:

-E: Interpret (\b.+) \1\b as an extended regular expression.

-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Regex:

\b: Is a zero-width word boundary.

.+: Matches one or more characters.

\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.

Reference: The Stack Overflow Regular Expressions FAQ

It sounds like something like this is what you want (using any awk in any shell on every UNIX box):

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    head = prev = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+/) ) {
        word = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
        tail = substr(tail,RSTART+RLENGTH)
        prev = word
    }
    print head tail
}

$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back

$ awk -f tst.awk file
the quick  brown
fox jumped
 over the lazy
 dogs back

but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.

Find repeated words in a text

Tags:

Linux

Regex

Bash

Text Editing

Spell Check

Related

Recent Posts