How can I match nested brackets using regex?
Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.
A demo in Perl:
#!/usr/bin/perl -w
my $text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';
while($text =~ /(\(([^()]|(?R))*\))/g) {
print("----------\n$1\n");
}
which will print:
---------- (outer (center (inner) (inner) center) ouer) ---------- (outer (inner) ouer) ---------- (outer ouer)
Or, the PHP equivalent:
$text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';
preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);
print_r($matches);
which produces:
Array ( [0] => Array ( [0] => (outer (center (inner) (inner) center) ouer) [1] => (outer (inner) ouer) [2] => (outer ouer) ) ...
An explanation:
( # start group 1 \( # match a literal '(' ( # group 2 [^()] # any char other than '(' and ')' | # OR (?R) # recursively match the entir pattern )* # end group 2 and repeat zero or more times \) # match a literal ')' ) # end group 1
EDIT
Note @Goozak's comment:
A better pattern might be
\(((?>[^()]+)|(?R))*\)
(from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.
Don't use regex.
Instead, a simple recursive function will suffice. Here's the general structure:
def recursive_bracket_parser(s, i):
while i < len(s):
if s[i] == '(':
i = recursive_bracket_parser(s, i+1)
elif s[i] == ')':
return i+1
else:
# process whatever is at s[i]
i += 1
return i
For example, here's a function that will parse the input into a nested list structure:
def parse_to_list(s, i=0):
result = []
while i < len(s):
if s[i] == '(':
i, r = parse_to_list(s, i+1)
result.append(r)
elif s[i] == ')':
return i+1, result
else:
result.append(s[i])
i += 1
return i, result
Calling this like parse_to_list('((a) ((b)) ((c)(d)))efg')
produces the result [[['a'], ' ', [['b']], ' ', [['c'], ['d']]], 'e', 'f', 'g']
.