Raku regex: How to know which group was captured at an alternation
There are a few ways to do, with varying degrees of utility.
One way would be to explicitly tell Raku what you want the numbers to be:
'bar' ~~ m/$1=(foo)|$2=(bar)/;
If you extend the regex, counting will continue at $3.
A less-recommendable way to do this would be to sneak in an extra set of parentheses:
'bar' ~~ m/(foo)|()(bar)/;
foo
will match the first one in $0 and $1 will be undefined, and bar
will match the $1 with $0 being empty (but not undefined). TIMTOWTDI but this is not a good one ;-)
Another way could be to use a flag:
my $flag;
'bar' ~~ m/(foo {$flag = 'first'} ) | (bar {$flag = 'second'} )/;
The flag will be set based on the match. This can actually be a not-terrible way to do things, especially if your flag is binary and you will have some logic that you'll run over it.
Another similar way would be to take advantage of the .make
/.made
that's normally used in action classes, but can still be used inline too:
'bar' ~~ m/(foo {make 'first'} ) | (bar {make 'second'} )/;
say $0.made; # 'second'
This one is nice if you have a lot of metadata you want to associate with it (but probably overkill for just knowing which one was chosen).
There are a few things that cause the capture index to reset. |
and ||
happen to be one.
Putting it inside of another capture group is another. (Because the match result is a tree.)
When Raku was being designed everything was redesigned to be more consistent, more useful, and more powerful. Regexes included.
If you have an alternation something like this:
/ (foo) | (bar) /
You might want to use it like this:
$line ~~ / (foo) | (bar) /;
say %h{ ~$0 };
If the (bar)
was $1
instead, you would have to write it something like this:
$line ~~ / (foo) | (bar) /;
say %h{ ~$0 || ~$1 };
It is generally more useful for the capture group numbering to start again from zero.
This also makes it so that a regex is more like a general purpose programming language. (Each “block” is an independant subexpression.)
Now sometimes it might be nice to renumber the capture groups.
/ ^
[ (..) '-' (..) '-' (....) # mm-dd-yyyy
| (..) '-' (....) # mm-yyyy
]
$ /
Notice that the yyyy
part is either $2
or $1
depending on whether the dd
part is included.
my $day = +$2 ?? $1 !! 1;
my $month = +$0;
my $year = +$2 || +$1;
We can renumber the yyyy
to always be $2
.
/ ^
[ (..) '-' (..) '-' (....) # mm-dd-yyyy
| (..) '-' $2 = (....) # mm-yyyy
]
$ /
my $day = +$1 || 1;
my $month = +$0;
my $year = +$2;
Or what if we need to also accept yyyy-mm-dd
/ ^
[ (..) '-' (..) '-' (....) # mm-dd-yyyy
| (..) '-' $2 = (....) # mm-yyyy
| $2 = (....) '-' $0 = (..) '-' $1 = (..) # yyyy-mm-dd
]
$ /
my $day = +$1 || 1
my $month = +$0;
my $year = +$2;
Actually now that we have a lot of capture groups let's look again how we would handle it if |
didn't cause the numbered capture groups to start again from $0
/ ^
[ (..) '-' (..) '-' (....) # mm-dd-yyyy
| (..) '-' (....) # mm-yyyy
| (....) '-' (..) '-' (..) # yyyy-mm-dd
]
$ /
my $day = +$1 || +$7 || 1;
my $month = +$0 || +$3 || +$6;
my $year = +$2 || +$4 || +$5;
That is not great.
For one thing you have to make sure both the regex and the my $day
match up correctly.
Quick without counting capture groups, make sure that those numbers match the correct capture groups.
Of course that still has the issue that concepts which have a name are instead captured by a number.
So we should use names instead.
/ ^
[ $<month> = (..) '-' $<day> = (..) '-' $<year> = (....) # mm-dd-yyyy
| $<month> = (..) '-' $<year> = (....) # mm-yyyy
| $<year> = (....) '-' $<month> = (..) '-' $<day> = (..) # yyyy-mm-dd
]
$ /
my $day = +$<day> || 1;
my $month = +$<month>;
my $year = +$<year>;
So long story short, I would do this:
/ $<foo> = (foo) | $<bar> = (bar) /;
if $<foo> {
…
} elsif $<bar> {
…
}