How to find non repetitive letter from a given string

uniq only works on adjacent duplicates - so if you want to use that, you'd need to sort your input first, for example:

fold -w1 | sort | uniq -u | paste -sd ''

fold -w1 does the same as your sed 's/./&\n/g' but without introducing an extra spurious newline
sort to make duplicate characters adjacent
uniq -u the -u is important to only print singletons
paste -sd '' joins the result back into a single line

Because of the sorting, you will not be able to get your desired output order in all cases ex.

$ echo 'AAAbefhMThkkD' | fold -w1 | sort | uniq -u | paste -sd ''
  DMTbef

If you don't want to roll your own solution, you could always use Perl's MoreUtils:

$ echo 'AAAbefhMThkkD' |
    perl -MList::MoreUtils=singleton -ne 'print singleton split //'
befMTD

awk '
{
  n=split($0, a, "")
  for(i=1; i<=n; i++){
    if(gsub(a[i], "") == 1){ printf("%s", a[i]) }
  }
  print ""
}'

n=split($0, a, ""): a[1] becomes the 1st character of the string, a[2] the 2nd, etc. n is the total number of characters.
for(i=1; i<=n; i++): Let's loop over all the array a.
if(gsub(a[i], "") == 1): Delete all a[i] characters from the string. If only one character was deleted on the string,
- printf("%s", a[i]) print that character.
print "" prints a newline character after all the line has been processed. This is optional if you have a single input line.

Example with condensed one-liner:

$ awk '{n=split($0,a,"");for(i=1;i<=n;i++)if(gsub(a[i],"")==1)printf("%s",a[i])}' <<< AAAbefhMThkkD
befMTD

Note: Splitting on a null string is not defined by POSIX. However, gawk (the GNU Awk), mawk and original-awk all implement the operation as desired.

With sed, you could do something like:

sed '
  :1
  /\(.*\(.\).*\)\2/ { # while there is a duplicated char
    s//\2\1/; # move it to the front
    :2
      # remove characters that are the same as the first in a loop:
      s/^\(\(.\).*\)\2/\1/
    t2
    s/^.//
    b1
  }'

With the GNU implementation of sed, you can shorten it to:

sed -E ':1;s/(.*(.).*)\2/\2\1/;T;:2;s/^((.).*)\2/\1/;t2;s/^.//;t1'

If you want to do the check for duplicates case insensitively (for áÁbBcδΔ to become c for instance), you can add the i flag to the first 2 s commands in the GNU sed code above. Note however that it won't work for things like German ß vs SS.

And that would still not handle Unicode equivalence and work at character (not grapheme cluster) level, so for instance if you have aéá where those accented letters are expressed in their decomposed form, not only a U+00E9 é would not be considered the same as a U+0065 U+0301 é, but that aéá expressed as U+0061 U+0065 U+0301 U+0061 U+0301 would become e (U+0065), the only non-duplicated character in there, even if those 5 characters actually end up forming 3 distinct grapheme clusters. My first name in its decomposed form would become St́phan (with the combining acute accent landing on the t when both es are removed).

Using:

perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}//gi while m/(\X)\X*\1/i'

here extending @sitaram's answer (using -Mopen=locale to treat input as characters instead of bytes, \X instead of . to match a grapheme cluster instead of character, and \b{g} for grapheme cluster boundary) would address some of those issues (not breaking down grapheme clusters in the middle, ß vs SS), but not the unicode equivalence:

$ echo $'groß KUSS. Ste\u0301phane, \ue9' |  perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}//gi while m/(\X)\X*\1/i'
groKU.Stéphane,é

(ß spotted as duplicate of SS, the e in e\u0301 not associated with the standalone e, but the two variants of é not recognised as the same).

Also note that ß/SS would be turned to / as ß is processed first while SS/ß would turned to /ß as the S is processed first.

It would also turn ßA/SAS into / as removing the duplicate As would reveal a SS, the uppercase version of ß. To avoid that, you could change it to:

perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}/\n/gi while m/((?!\n)\X)\X*\1/i; s/\n//g'

That is, instead of removing the duplicate grapheme clusters, we change them to newline preventing characters on either side to be joined into a sequence of grapheme clusters that could the uppercase or lowercase variant of another grapheme cluster.

How to find non repetitive letter from a given string

Tags:

Text Processing

Related

Recent Posts