How to find non repetitive letter from a given string
uniq
only works on adjacent duplicates - so if you want to use that, you'd need to sort your input first, for example:
fold -w1 | sort | uniq -u | paste -sd ''
fold -w1
does the same as yoursed 's/./&\n/g'
but without introducing an extra spurious newlinesort
to make duplicate characters adjacentuniq -u
the-u
is important to only print singletonspaste -sd ''
joins the result back into a single line
Because of the sorting, you will not be able to get your desired output order in all cases ex.
$ echo 'AAAbefhMThkkD' | fold -w1 | sort | uniq -u | paste -sd ''
DMTbef
If you don't want to roll your own solution, you could always use Perl's MoreUtils
:
$ echo 'AAAbefhMThkkD' |
perl -MList::MoreUtils=singleton -ne 'print singleton split //'
befMTD
awk '
{
n=split($0, a, "")
for(i=1; i<=n; i++){
if(gsub(a[i], "") == 1){ printf("%s", a[i]) }
}
print ""
}'
n=split($0, a, "")
:a[1]
becomes the 1st character of the string,a[2]
the 2nd, etc.n
is the total number of characters.for(i=1; i<=n; i++)
: Let's loop over all the arraya
.if(gsub(a[i], "") == 1)
: Delete alla[i]
characters from the string. If only one character was deleted on the string,printf("%s", a[i])
print that character.
print ""
prints a newline character after all the line has been processed. This is optional if you have a single input line.
Example with condensed one-liner:
$ awk '{n=split($0,a,"");for(i=1;i<=n;i++)if(gsub(a[i],"")==1)printf("%s",a[i])}' <<< AAAbefhMThkkD
befMTD
Note: Splitting on a null string is not defined by POSIX. However, gawk
(the GNU Awk), mawk
and original-awk
all implement the operation as desired.
With sed
, you could do something like:
sed '
:1
/\(.*\(.\).*\)\2/ { # while there is a duplicated char
s//\2\1/; # move it to the front
:2
# remove characters that are the same as the first in a loop:
s/^\(\(.\).*\)\2/\1/
t2
s/^.//
b1
}'
With the GNU implementation of sed
, you can shorten it to:
sed -E ':1;s/(.*(.).*)\2/\2\1/;T;:2;s/^((.).*)\2/\1/;t2;s/^.//;t1'
If you want to do the check for duplicates case insensitively (for áÁbBcδΔ
to become c
for instance), you can add the i
flag to the first 2 s
commands in the GNU sed
code above. Note however that it won't work for things like German ß
vs SS
.
And that would still not handle Unicode equivalence and work at character (not grapheme cluster) level, so for instance if you have aéá
where those accented letters are expressed in their decomposed form, not only a U+00E9 é
would not be considered the same as a U+0065 U+0301 é
, but that aéá
expressed as U+0061 U+0065 U+0301 U+0061 U+0301
would become e
(U+0065), the only non-duplicated character in there, even if those 5 characters actually end up forming 3 distinct grapheme clusters. My first name in its decomposed form would become St́phan
(with the combining acute accent landing on the t
when both e
s are removed).
Using:
perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}//gi while m/(\X)\X*\1/i'
here extending @sitaram's answer (using -Mopen=locale
to treat input as characters instead of bytes, \X
instead of .
to match a grapheme cluster instead of character, and \b{g}
for grapheme cluster boundary) would address some of those issues (not breaking down grapheme clusters in the middle, ß
vs SS
), but not the unicode equivalence:
$ echo $'groß KUSS. Ste\u0301phane, \ue9' | perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}//gi while m/(\X)\X*\1/i'
groKU.Stéphane,é
(ß
spotted as duplicate of SS
, the e
in e\u0301
not associated with the standalone e
, but the two variants of é
not recognised as the same).
Also note that ß/SS
would be turned to /
as ß
is processed first while SS/ß
would turned to /ß
as the S
is processed first.
It would also turn ßA/SAS
into /
as removing the duplicate A
s would reveal a SS
, the uppercase version of ß
. To avoid that, you could change it to:
perl -Mopen=locale -lpe 's/\b{g}\Q$1\E\b{g}/\n/gi while m/((?!\n)\X)\X*\1/i; s/\n//g'
That is, instead of removing the duplicate grapheme clusters, we change them to newline preventing characters on either side to be joined into a sequence of grapheme clusters that could the uppercase or lowercase variant of another grapheme cluster.