Remove duplicate chars using regex?

In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this

 s="ababacbdefefbcdefde"

 while re.search(r'([a-z])(.*)\1', s):
     s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)

 print s  # prints 'abcdef'

>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'

The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.

Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.

On side note...

Your example code for just a is actually buggy:

>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'

You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

Remove duplicate chars using regex?

On side note...

Tags:

Python

String

Regex

Related

Recent Posts