Remove repeated elements in a string with R
As easy as:
gsub("(.{2,})\\1+","\\1",str, perl = T)
SOLUTION
The solution is to rely on the PCRE or ICU regex engines, rather than TRE.
Use either base R gsub
with perl=TRUE
(it uses PCRE regex engine) and "(?s)(.{2,})\\1+"
pattern, or a stringr::str_replace_all()
(it uses ICU regex engine) with the same pattern:
> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"
The (?s)
flag is necessary for .
to match any char including line break chars (in TRE regex, .
matches all chars by default).
DETAILS
TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):
The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.
Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by anyregexec()
call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.
In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub
call.