How to remove consecutive identical words from a string in python
Short regex magic:
import re
mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)
regex pattern details:
\b
- word boundary(\w+\s*)
- one or more word chars\w+
followed by any number of whitespace characters\s*
- enclosed into a captured group(...)
\1{1,}
- refers to the 1st captured group occurred one or more times{1,}
The output:
my friend's new and old cats are running in the street
Using itertools.groupby
:
import itertools
>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"
mystring.split()
splits themystring
.itertools.groupby
efficiently groups the consecutive words byk
.- Using list comprehension, we just take the group key.
- We join using a space.
The complexity is linear in the size of the input string.