Command line method to find repeat-word typos, with line numbers

Edited: added install and demo

You need to take care of at least some edge cases, like

repeated words at the end (and beginning) of the line.
search should be case insensitive, because of frequent errors like The the apple.
probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
only full words should match to eliminate the thesis
When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

has has
and
and
be be

You should take a peek at the venerable diction(1) and style(1) commands. They catch a variety of boo-boos. There are newish versions (GPLv3 here on Fedora 23).

Install

For example on Debian-based distributions, install the package diction, which includes style:

$ sudo apt-get install diction

At least in Fedora it is:

$ dnf install diction

Red Hat Enterprise (and clones) probably need:

$ yum install diction

In any case, this comes from an upstream GNU package called diction, so it should be called the same almost everywhere.

Example

$ diction jefferson_typo.txt
jefferson_typo.txt:1: He has [has] refused his Assent to Laws, the [most] wholesome and necessary for the public good.

jefferson_typo.txt:3: He has forbidden his Governors to pass Laws of immediate and [and] pressing importance, unless suspended in their operation till his Assent should be [be] obtained; and when [so] suspended, he has utterly neglected to attend to them.

2 phrases in 2 sentences found.

Pros

catches the repeated words, amongst other things

Cons

introduces [] markings for items not related to repeated words. For example [so], is probably marked because it can be considered extraneous per The Elements of Style by Strunk. See man diction
the number shown is not always the original input's line number, but is instead the line number that the sentence starts from. So for example [be] is original input's line number 5, but here it shows 3 only because [be] is a part of the sentence beginning on line 3. So this is slightly different than what you wanted

This will print lines (with filename and line number) with repeated words:

for f in *.txt; do
    perl -ne 'print "$ARGV: $.: $_" if /\b(\w+)\W+\1/' "$f"
done

For multi-line matching there's this, but you lose the line numbers because it's slurping in the file by paragraphs (that's the effect of the -00 option). The \W+ between the two words means any "non-word" characters, including newlines.

perl -00 -nE '
    @matches = /\b((\w+)\W+\2)/g; 
    while (@matches) {
        ($match,$word) = splice @matches, 0, 2;
        say "dup: $match";
    }
' jefferson_typo.txt

dup: has has
dup: and
and
dup: be be

Command line method to find repeat-word typos, with line numbers

Install

Example

Install

Example

Tags:

Command Line

Bash

Awk

Text Processing

Aspell

Related

Recent Posts