How to get the longest bracket pairs from a string
First Case
str = "xx <aa <bbb> <bbb> aa> yy<<dfa>a>";
StringCases[str,
RegularExpression["(?P<a><([^<>]|(?P>a))*>)"]
]
(* {"<aa <bbb> <bbb> aa>", "<<dfa>a>"} *)
This works as follows:
(?P<a> ...)
namesa
the pattern<([^<>]|(?P>a))*>
.- The string or substring matching this pattern must start with
<
and end with>
. - Within these characters, the pattern
([^<>]|(?P>a))
can be repeated 0 or more times. - This subpattern says that no character can be
<
or>
. If such a character is met while reading the string, then the patterna
is called by(?P>a)
and we start again at bullet 2 with the substring starting with this character.
Second Case
str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))"
StringCases[str2,
RegularExpression["(?P<a>(\\[|\\()([^\\[\\]\\(\\)]|(?P>a))*(\\]|\\)))"]
]
(* {"[ab*[c]d]", "(b(x99))"} *)
This works as above. Here, instead of <
at the beginning of the (sub)string, we allow for [
or (
with (\\[|\\()
. The other modifications are in line with this change.
Note that this regular expression may not be satisfying for cases such as
str3 = "dd9[ab*[c]d)esiddx(45x(b(x99))";
(* The square bracket after d is replaced by a parenthesis. *)
StringCases[str3,
RegularExpression["(?P<a>(\\[|\\()([^\\[\\]\\(\\)]|(?P>a))*(\\]|\\)))"]
]
(* {"[ab*[c]d)", "(b(x99))"} *)
The first element starts with a [
and ends with )
. This can be avoided by adding a pattern and a condition test on this pattern:
StringCases[str3,
RegularExpression["(?P<a>((?P<b>\\[)|\\()([^\\[\\]\\(\\)]|(?P>a))*(?(b)\\]|\\)))"]
]
(* {"[c]", "(b(x99))"} *)
The starting [
is referred to as b
. The pattern (?(b)\\]|\\))
tells us that if b
had a match, then the character to match should be ]
, or otherwise )
.
This works:
str = "xx <aa <bbb> <bbb> aa> yy<<dfa>a>";
StringCases[str, "<" ~~ Shortest@s___ ~~ ">" /; StringCount[s, "<"] == StringCount[s, ">"]]
{"<aa <bbb> <bbb> aa>", "<<dfa>a>"}
Or equivalently
StringCases[str,
s : RegularExpression["<.*?>"] /; StringCount[s, "<"] == StringCount[s, ">"]]
{"<aa <bbb> <bbb> aa>", "<<dfa>a>"}
Of course it isn't a pure regex approach: the method uses Condition
. Similar approach is used in this answer of mine where an extended explanation of joint working of Condition
together with lazy quantifier Shortest
(or *?
in regex) is given.
The second problem can be solved using two patterns of the same type as alternatives:
Clear[balanced]
balanced[{l_, r_}] :=
HoldPattern[(left : l ~~ Shortest@s___ ~~ right : r) /;
StringCount[s, left] == StringCount[s, right]]
str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))";
StringCases[str2, balanced /@ {{"[", "]"}, {"(", ")"}}]
{"[ab*[c]d]", "(b(x99))"}
Or we can combine them into single pattern as follows:
StringCases[str2, (left : "[" | "(" ~~ Shortest@s___ ~~ right : "]" | ")") /;
MatchQ[{left, right}, {"[", "]"} | {"(", ")"}] &&
StringCount[s, left] == StringCount[s, right]]
{"[ab*[c]d]", "(b(x99))"}
Not a regular expression but counting the left and right separators to find positions where they're equal in number can find top level bracket pairs:
str1 = "xx<aa<bbb> <bbb>aa>yy<<dfa>a>";
str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))";
f[l_, r_, str_] := Module[{sum, pos},
sum = Accumulate[StringCases[str, l | r] /. {l -> 1, r -> -1}];
pos = First /@ StringPosition[str, (l | r)];
Partition[(First /@
SplitBy[Transpose[{sum, pos}], #[[1]] == 0 &])[[All, 2]], 2]
];
Works for strings with complete pairs:
f["<", ">", str1]
f["[", "]", str2]
{{3, 19}, {22, 29}}
{{4, 12}}
But does not work for e.g. f["(", ")", str2]
because str2
has one more opening (
than )
.