How to find a last occurrence of set of characters in string using regex in java?
There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately
java.util.regex
doesn't allow to to match the pattern from right to left.Instead of using a greedy quantifier simply match the pattern and loop
Matcher.find()
until last occurrence is found.Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as
a|b
, the engine will try to match subpatterna
first, and if that yields no match, it will reset the input stream and try to matchb
instead.If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
Performance issues with the (?s).*(x|y|z)
regex come from the fact the .*
pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x
, y
or z
. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.
The ([xyz])(?=[^xyz]*$)
pattern seems a little bit better: it captures x
, y
or z
and asserts there is no other x
, y
or z
up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.
The fastest regex to get your job done is
^(?:[^xyz]*+([xyz]))+
It matches
^
- start of string(?:[^xyz]*+([xyz]))+
- 1 or more repetitions of[^xyz]*+
- any 0 or more chars other thanx
,y
andz
matched possessively (no backtracking into the pattern is allowed)([xyz])
- Group 1:x
,y
orz
.
The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).