How to remove duplicate letters using sed?
Method #1
You can use this sed
command to do it:
$ sed 's/\([A-Za-z]\)\1\+/\1/g' file.txt
Example
Using your above sample input I created a file, sample.txt
.
$ sed 's/\([A-Za-z]\)\1\+/\1/g' sample.txt
NAME
nice - run a program with modified scheduling priority
SYNOPSIS
nice [-n adjustment] [-adjustment] [--adjustment=adjustment] [command [a$
Method #2
There is also this method which will remove all the duplicate characters:
$ sed 's/\(.\)\1/\1/g' file.txt
Example
$ sed 's/\(.\)\1/\1/g' sample.txt
NAME
nice - run a program with modified scheduling priority
SYNOPSIS
nice [-n adjustment] [-adjustment] [-adjustment=adjustment] [command [a$
Method #3 (just the upper case)
The OP asked if you could modify it so that only the upper case characters would be removed, here's how using a modified method #1.
Example
$ sed 's/\([A-Z]\)\1\+/\1/g' sample.txt
NAME
nice - run a program with modified scheduling priority
SYNOPSIS
nice [-n adjustment] [-adjustment] [--adjustment=adjustment] [command [a$
Details of the above methods
All the examples make use of a technique where when a character is first encountered that's in the set of characters A-Z or a-z that it's value is saved. Wrapping parens around characters tells sed
to save them for later. That value is then stored in a temporary variable that you can access either immediately or later on. These variables are named \1 and \2.
So the trick we're using is we match the first letter.
\([A-Za-z]\)
Then we turn around and use the value that we just saved as a secondary character that must occur right after the first one above, hence:
\([A-Za-z]\)\1.
In sed
we're also making use of the search and replace facility, s/../../g
. The g
means we're doing it globally.
So when we encounter a character, followed by another one, we substitute it out, and replace it with just one of the same character.
This command removes all double letters:
sed 's/\([[:alpha:]]\)\1/\1/g'
\1
stands for the text inside \(…\)
, so this command means: wherever there's an alphabetical character followed by itself, replace by that alphabetical character alone.
That will transform e.g. command
into comand
. It would be better to restrict the transformation to where it's needed: non-indented lines.
sed '/^[[:alpha:]]/ s/\([[:alpha:]]\)\1/\1/g'
This text is a man page rendered for terminals where bold is represented by overstrike: C\bC
is rendered as bold, where \b
is the backspace character (character number 8, also known as ^H). If the control characters are still there, forget about duplicate letters and instead remove the overstrike.
sed -e 's/.\b//g'
If you have a way to format the output, transofmr C\bC
to bold and _\bC
to underline.
sed -e 's/\(.\)\b\1/\e[1m\1\e[22m/g' -e 's/_\b\(.\)/\e[4m\1\e[24m/g' |
sed -e 's/\e[22m\e[1m//g' -e 's/\e[24m\e[4m//g'
If your sed doesn't understand backslash escapes, use the literal characters (Ctrl+H for \b
and Ctrl+[ for \e
).
This is by no means a trivial task. A simple substitution for letter doubles would be disastrous. Think of what it would do to words like "attention" or "forgetting" or (more relevant to your case) "command". The script below is a naive first try at a solution. It makes use of a dictionary to determine which words really have duplicate letters.
#!/usr/bin/perl
use strict;
use warnings;
my $input_file = shift//die "No file name given\n";
my $dictionary = shift//'/usr/share/dict/words';
open my $if,'<',$input_file or die "$input_file: $!\n";
open my $dict,'<',$dictionary or die "$dictionary: $!\n";
my %dictionary;
for(<$dict>){
chomp;
$dictionary{$_}++;
}
close $dictionary;
LINE: while(<$if>){
chomp;
WORD: for my $word ( split /\s+/ ){
print "$word " and next WORD if exists $dictionary{lc $word};
SUBSTITUTION: while($word=~ s{([A-Z])\1}{$1}i){
exists $dictionary{lc $word} and last SUBSTITUTION;
} #END SUBSTITUTION
print "$word ";
} #END WORD
print "\n";
} #END LINE
Call it like
[user@host]./myscript.pl input_file optional_dictionary_file >output_file
If you don't supply a second argument, the dictionary file defaults to /usr/share/dict/words
, which should be available on a decent GNU/Linux.
Disclaimer: This is untested.
Caveats:
- It will break at least with hyphenated words (it uses spaces to decide what a "word" is).
- It will only remove duplicated capitals to avoid messing with the contents of the
man
page themselves. - It will wreak havoc on hexadecimals like
0xFFFF
. - Probably many more that I can't see.