Command line method to find repeat-word typos, with line numbers
Edited: added install and demo
You need to take care of at least some edge cases, like
- repeated words at the end (and beginning) of the line.
- search should be case insensitive, because of frequent errors like
The the apple
. - probably you want to restrict search only to word constituent to not match something like
( ( a + b) + c )
(repeated opening parentheses. - only full words should match to eliminate
the thesis
- When it comes to human language Unicode characters inside words should properly interpreted
All in all I recommend pcregrep
solution:
pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file
Obviously color and line number (n
option) is optional, but usually nice to have.
Install
On Debian-based distributions you can install via:
$ sudo apt-get install pcregrep
Example
Run the command on jefferson_typo.txt
to see:
$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly
The above is just a text capture, but on a color-supported terminal, matches are colorized:
- has has
- and
- and
- be be
You should take a peek at the venerable diction(1)
and style(1)
commands. They catch a variety of boo-boos. There are newish versions (GPLv3 here on Fedora 23).
Install
For example on Debian-based distributions, install the package diction
, which includes style
:
$ sudo apt-get install diction
At least in Fedora it is:
$ dnf install diction
Red Hat Enterprise (and clones) probably need:
$ yum install diction
In any case, this comes from an upstream GNU package called diction
, so it should be called the same almost everywhere.
Example
$ diction jefferson_typo.txt
jefferson_typo.txt:1: He has [has] refused his Assent to Laws, the [most] wholesome and necessary for the public good.
jefferson_typo.txt:3: He has forbidden his Governors to pass Laws of immediate and [and] pressing importance, unless suspended in their operation till his Assent should be [be] obtained; and when [so] suspended, he has utterly neglected to attend to them.
2 phrases in 2 sentences found.
Pros
- catches the repeated words, amongst other things
Cons
- introduces
[]
markings for items not related to repeated words. For example[so]
, is probably marked because it can be considered extraneous per The Elements of Style by Strunk. Seeman diction
- the number shown is not always the original input's line number, but is instead the line number that the sentence starts from. So for example
[be]
is original input's line number 5, but here it shows3
only because[be]
is a part of the sentence beginning on line3
. So this is slightly different than what you wanted
This will print lines (with filename and line number) with repeated words:
for f in *.txt; do
perl -ne 'print "$ARGV: $.: $_" if /\b(\w+)\W+\1/' "$f"
done
For multi-line matching there's this, but you lose the line numbers because it's slurping in the file by paragraphs (that's the effect of the -00
option). The \W+
between the two words means any "non-word" characters, including newlines.
perl -00 -nE '
@matches = /\b((\w+)\W+\2)/g;
while (@matches) {
($match,$word) = splice @matches, 0, 2;
say "dup: $match";
}
' jefferson_typo.txt
dup: has has
dup: and
and
dup: be be