Raku: effect of capture markers is lost "higher up"

TL;DR Use "multiple dispatch".^[1,2] See @user0721090601's answer for a thorough explanation of why things are as they are. See @p6steve's for a really smart change to your grammar if you want your number syntax to match Raku's.

A multiple dispatch solution

Is there a way around this?

One way is to switch to explicit multiple dispatch.

You currently have a value token which calls specifically named value variants:

    token value { <strvalue> | <numvalue> }

Replace that with:

    proto token value {*}

and then rename the called tokens according to grammar multiple dispatch targeting rules, so the grammar becomes:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value {*}
    token value:str { '"' <( <-["]>* )> '"' }
    token value:num { '-'? \d+ [ '.' \d* ]? }
}

say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');

This displays:

｢foo = 42｣
 keyword => ｢foo｣
 value => ｢42｣
｢bar = "Hello, World!"｣
 keyword => ｢bar｣
 value => ｢Hello, World!｣

This doesn't capture the individual alternations by default. We can stick with "multiple dispatch" but reintroduce naming of the sub-captures:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value { * }
    token value:str { '"' <( $<strvalue>=(<-["]>*) )> '"' }
    token value:num { $<numvalue>=('-'? \d+ [ '.' \d* ]?) }
}

say MyGrammar.parse('foo = 42');
say MyGrammar.parse('bar = "Hello, World!"');

displays:

｢foo = 42｣
 keyword => ｢foo｣
 value => ｢42｣
  numvalue => ｢42｣
｢bar = "Hello, World!"｣
 keyword => ｢bar｣
 value => ｢Hello, World!｣
  strvalue => ｢Hello, World!｣

Surprises

to my surprise, the quotes are included in value.

I too was initially surprised.^[3]

But the current behaviour also makes sense to me in at least the following senses:

The existing behaviour has merit in some circumstances;
It wouldn't be surprising if I was expecting it, which I think I might well have done in some other circumstances;
It's not easy to see how one would get the current behaviour if it was wanted but instead worked as you (and I) initially expected;
There's a solution, as covered above.

Footnotes

^[1] Use of multiple dispatch^[2] is a solution, but seems overly complex imo given the original problem. Perhaps there's a simpler solution. Perhaps someone will provide it in another answer to your question. If not, I would hope that we one day have at least one much simpler solution. However, I wouldn't be surprised if we don't get one for many years. We have the above solution, and there's plenty else to do.

^[2] While you can declare, say, method value:foo { ... } and write a method (provided each such method returns a match object), I don't think Rakudo uses the usual multiple method dispatch mechanism to dispatch to non-method rule alternations but instead uses an NFA.

^[3] Some might argue that it "should", "could", or "would" "be for the best" if Raku did as we expected. I find I think my best thoughts if I generally avoid [sh|c|w]oulding about bugs/features unless I'm willing to take any and all downsides that others raise into consideration and am willing to help do the work needed to get things done. So I'll just say that I'm currently seeing it as 10% bug, 90% feature, but "could" swing to 100% bug or 100% feature depending on whether I'd want that behaviour or not in a given scenario, and depending on what others think.

The <( and )> capture markers only work within a given a given token. Basically, each token returns a Match object that says "I matched the original string from index X (.from) to index Y (.to)", which is taken into account when stringifying Match objects. That's what's happening with your strvalue token:

my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;

my $start = $m<value><strvalue>.from;     # 7
my $end   = $m<value><strvalue>.to;       # 20
say $text.substr: $start, $end - $start;  # Hello, World!

You'll notice that there are only two numbers: a start and finish value. This mens that when you look at the value token you have, it can't create a discontiguous match. So it's .from is set to 6, and its .to to 21.

There are two ways around this: by using (a) an actions object or (b) a multitoken. Both have their advantages, and depending on how you want to use this in a larger project, you might want to opt for one or the other.

While you can technically define actions directly within a grammar, it's much easier to do them via a separate class. So we might have for you:

class MyActions { 
  method TOP      ($/) { make $<keyword>.made => $<value>.made }
  method keyword  ($/) { make ~$/ }
  method value    ($/) { make ($<numvalue> // $<strvalue>).made }
  method numvalue ($/) { make +$/ }
  method strvalue ($/) { make ~$/ }
}

Each level make to pass values up to whatever token includes it. And the enclosing token has access to their values via the .made method. This is really nice when, instead of working with pure string values, you want to process them first in someway and create an object or similar.

To parse, you just do:

my $m = MyGrammar.parse: $text, :actions(MyActions);
say $m.made; # bar => Hello, World!

Which is actually a Pair object. You could change the exact result by modifying the TOP method.

The second way you can work around things is to use a multi token. It's fairly common in developing grammars to use something akin to

token foo { <option-A> | <option-B> }

But as you can see from the actions class, it requires us to check and see which one was actually matched. Instead, if the alternation can acceptable by done with |, you can use a multitoken:

proto token foo { * }
multi token:sym<A> { ... }
multi token:sym<B> { ... }

When you use <foo> in your grammar, it will match either of the two multi versions as if it had been in the baseline <foo>. Even better, if you're using an actions class, you can similarly just use $<foo> and know it's there without any conditionals or other checks.

In your case, it would look like this:

grammar MyGrammar
{
    rule TOP { <keyword> '=' <value> }
    token keyword { \w+ }
    proto token value { * }
    multi token value:sym<str> { '"' <( <-["]>* )> '"' }
    multi token value:sym<num> { '-'? \d+ [ '.' \d* ]? }
}

Now we can access things as you were originally expecting, without using an actions object:

my $text = 'bar = "Hello, World!"';
my $m = MyGrammar.parse: $text;

say $m;        # ｢bar = "Hello, World!"｣
               #  keyword => ｢bar｣
               #  value => ｢Hello, World!｣

say $m<value>; # ｢Hello, World!｣

For reference, you can combine both techniques. Here's how I would now write the actions object given the multi token:

class MyActions { 
  method TOP            ($/) { make $<keyword>.made => $<value>.made }
  method keyword        ($/) { make ~$/ }
  method value:sym<str> ($/) { make ~$/ }
  method value:sym<num> ($/) { make +$/ }
}

Which is a bit more grokkable at first look.

Rather than rolling your own token value:str & token value:num you may want to use Regex Boolean check for Num (+) and Str (~) matching - as explained to me here and documented here

token number { \S+ <?{ defined +"$/" }> }
token string { \S+ <?{ defined ~"$/" }> }

Raku: effect of capture markers is lost "higher up"

A multiple dispatch solution

Surprises

Footnotes

Tags:

Regex

Raku

Grammar

Related

Recent Posts