Find repeated words in a text
With GNU grep:
echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' | grep -Eo '(\b.+) \1\b'
Output:
twice twice as as here here 123 123
Options:
-E
: Interpret (\b.+) \1\b
as an extended regular expression.
-o
: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
Regex:
\b
: Is a zero-width word boundary.
.+
: Matches one or more characters.
\1
: The parentheses ()
mark a capturing group and \1
means use here the value from first capturing group.
Reference: The Stack Overflow Regular Expressions FAQ
It sounds like something like this is what you want (using any awk in any shell on every UNIX box):
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
head = prev = ""
tail = $0
while ( match(tail,/[[:alpha:]]+/) ) {
word = substr(tail,RSTART,RLENGTH)
head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
tail = substr(tail,RSTART+RLENGTH)
prev = word
}
print head tail
}
$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back
$ awk -f tst.awk file
the quick brown
fox jumped
over the lazy
dogs back
but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.