Is it possible to erase a capture group that has already matched, making it non-participating?
I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":
12. There are some differences that are concerned with the settings of captured strings when part of a pattern is repeated. For example, matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to "b".
I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:
Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:
/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/
Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:
\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z
Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:
1. Enclose the entire subexpression in (?!(?! ))
:
\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z
Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.
2. Branch reset group to reset the value of capture groups:
\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z
With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.
3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:
\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z
The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z)
is no longer true).
This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.
4. This doesn't really answer the question, but it solves the problem:
\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))
This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.
With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:
(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}
See live demo here
If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.
This is partially possible in .NET's flavour of regex.
The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*)
records each character in the first line as a separate capture in the group.
To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression)
.
- First,
name2
must have previously been captured. - The subexpression must then match.
- If
name1
is present, the substring between the end of the capture ofname2
and the start of the subexpression match is captured intoname1
. - The latest capture of
name2
is then deleted. (This means that the old value could be backreferenced in the subexpression.) - The match is advanced to the end of the subexpression.
If you know you have name2
captured exactly once then it can readily be deleted using (?<-name2>)
; if you don't know whether you have name2
captured then you could use (?>(?<-name2>)?)
or a conditional. The problem arises if you might have name2
captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2
. ((?<-name2>)*
doesn't work because *
is equivalent to ?
for zero-length matches.)