How to remove duplicate letters using sed?

Method #1

You can use this sed command to do it:

$ sed 's/\([A-Za-z]\)\1\+/\1/g' file.txt

Example

Using your above sample input I created a file, sample.txt.

$ sed 's/\([A-Za-z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Method #2

There is also this method which will remove all the duplicate characters:

$ sed 's/\(.\)\1/\1/g' file.txt

Example

$ sed 's/\(.\)\1/\1/g' sample.txt 
NAME
    nice - run a program with modified scheduling priority

    SYNOPSIS
       nice   [-n  adjustment]  [-adjustment] [-adjustment=adjustment] [command [a$

Method #3 (just the upper case)

The OP asked if you could modify it so that only the upper case characters would be removed, here's how using a modified method #1.

Example

$ sed 's/\([A-Z]\)\1\+/\1/g' sample.txt 
NAME
       nice - run a program with modified scheduling priority

       SYNOPSIS
              nice     [-n    adjustment]    [-adjustment] [--adjustment=adjustment] [command [a$

Details of the above methods

All the examples make use of a technique where when a character is first encountered that's in the set of characters A-Z or a-z that it's value is saved. Wrapping parens around characters tells sed to save them for later. That value is then stored in a temporary variable that you can access either immediately or later on. These variables are named \1 and \2.

So the trick we're using is we match the first letter.

\([A-Za-z]\)

Then we turn around and use the value that we just saved as a secondary character that must occur right after the first one above, hence:

\([A-Za-z]\)\1.

In sed we're also making use of the search and replace facility, s/../../g. The g means we're doing it globally.

So when we encounter a character, followed by another one, we substitute it out, and replace it with just one of the same character.

This command removes all double letters:

sed 's/\([[:alpha:]]\)\1/\1/g'

\1 stands for the text inside \(…\), so this command means: wherever there's an alphabetical character followed by itself, replace by that alphabetical character alone.

That will transform e.g. command into comand. It would be better to restrict the transformation to where it's needed: non-indented lines.

sed '/^[[:alpha:]]/ s/\([[:alpha:]]\)\1/\1/g'

This text is a man page rendered for terminals where bold is represented by overstrike: C\bC is rendered as bold, where \b is the backspace character (character number 8, also known as ^H). If the control characters are still there, forget about duplicate letters and instead remove the overstrike.

sed -e 's/.\b//g'

If you have a way to format the output, transofmr C\bC to bold and _\bC to underline.

sed -e 's/\(.\)\b\1/\e[1m\1\e[22m/g' -e 's/_\b\(.\)/\e[4m\1\e[24m/g' |
sed -e 's/\e[22m\e[1m//g' -e 's/\e[24m\e[4m//g'

If your sed doesn't understand backslash escapes, use the literal characters (Ctrl+H for \b and Ctrl+[ for \e).

This is by no means a trivial task. A simple substitution for letter doubles would be disastrous. Think of what it would do to words like "attention" or "forgetting" or (more relevant to your case) "command". The script below is a naive first try at a solution. It makes use of a dictionary to determine which words really have duplicate letters.

#!/usr/bin/perl

use strict;
use warnings;

my $input_file = shift//die "No file name given\n";
my $dictionary = shift//'/usr/share/dict/words';
open my $if,'<',$input_file or die "$input_file: $!\n";
open my $dict,'<',$dictionary or die "$dictionary: $!\n";
my %dictionary;
for(<$dict>){
    chomp;
    $dictionary{$_}++;
}
close $dictionary;

LINE: while(<$if>){
    chomp;

    WORD: for my $word ( split /\s+/ ){
            print "$word " and next WORD if exists $dictionary{lc $word};

            SUBSTITUTION: while($word=~ s{([A-Z])\1}{$1}i){
                exists $dictionary{lc $word} and last SUBSTITUTION;
            } #END SUBSTITUTION
            print "$word ";

     } #END WORD

     print "\n";

} #END LINE

Call it like

[user@host]./myscript.pl input_file optional_dictionary_file >output_file

If you don't supply a second argument, the dictionary file defaults to /usr/share/dict/words, which should be available on a decent GNU/Linux.

Disclaimer: This is untested.

Caveats:

It will break at least with hyphenated words (it uses spaces to decide what a "word" is).
It will only remove duplicated capitals to avoid messing with the contents of the man page themselves.
It will wreak havoc on hexadecimals like 0xFFFF.
Probably many more that I can't see.

How to remove duplicate letters using sed?

Method #1

Example

Method #2

Example

Method #3 (just the upper case)

Example

Details of the above methods

Tags:

Sed

Text Processing

Related

Recent Posts