Match and remove duplicated characters: Replace multiple (3+) non-consecutive occurrences
Non-regex R solution. Split string. Replace elements of this vector having rowid >= 3 * with '-'
. Paste it back together.
x <- '111aabbccxccybbzaa1'
xsplit <- strsplit(x, '')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = '')
# [1] "11-aabbccx--y--z---"
* rowid(x)
is an integer vector with each element representing the number of times the value from the corresponding element of x
has been realized. So if the last element of x
is 1
, and that's the fourth time 1
has occurred in x
, the last element of rowid(x)
is 4
.
You can easily accomplish this without regex:
See code in use here
s = '111aabbccxccybbzaa1'
for u in set(s):
for i in [i for i in range(len(s)) if s[i]==u][2:]:
s = s[:i]+'-'+s[i+1:]
print(s)
Result:
11-aabbccx--y--z---
How this works:
for u in set(s)
gets a list of unique characters in the string:{'c','a','b','y','1','z','x'}
for i in ...
loops over the indices that we gather in 3.[i for i in range(len(s)) if s[i]==u][2:]
loops over each character in the string and checks if it matchesu
(from step 1.), then it slices the array from the 2nd element to the end (dropping the first two elements if they exist)- Set the string to
s[:i]+'-'+s[i+1:]
- concatenate the substring up to the index with-
and then the substring after the index, effectively omitting the original character.
An option with gsubfn
library(gsubfn)
p <- proto(fun = function(this, x) if (count >=3) '-' else x)
for(i in c(0:9, letters)) x <- gsubfn(i, p, x)
x
#[1] "11-aabbccx--y--z---"
data
x <- '111aabbccxccybbzaa1'