How do I match a hex array in perl6 grammar
If you don't need to use grammars, you can do this:
my $a = "39 3A 3B 9:;";
say $a.split(/\s+/).grep: * ~~ /<< <[0..9 A..F]> ** 2 >>/;
The regex will match these 2-digit hexa strings. Anyway, the problem with your grammar might be in the number of spaces you're using; they are very strict in that sense.
<[abcdef...]>
in a P6 regex is a "character class" in the match-one-character sense.1
The idiomatic way to get what you want is to use the **
quantifier:
my $a = "39 3A 3B ";
grammar Hex {
token TOP { <hex_array>+ }
token hex_array { <[0..9 A..F]>**1..2 " " }
};
Hex.parse($a);
The rest of this answer is "bonus" material on why and how to use rule
s.
You are of course perfectly free to match whitespace situations by including whitespace patterns in arbitrary individual tokens, like you did with " "
in your hex_array
token.
However, it's good practice to use rule
s instead when appropriate -- which is most of the time.
First, use ws
instead of " ", \s*
etc.
Let's remove the space in the second token
and move it instead to the first one:
token TOP { [ <hex_array> " " ]+ }
token hex_array { <[0..9 A..F]>**1..2 }
We've added square bracketing ([...]
) that combines the hex_array
and a space and then applied the +
quantifier to that combined atom. It's a simple change, and the grammar continues to work as before, matching the space as before, except now the space won't be captured by the hex_array
token.
Next, let's switch to using the built in ws
token
:
token TOP { [ <hex_array> <.ws> ]+ }
The default <ws>
is more generally useful, in desirable ways, than \s*
.2 And if the default ws
doesn't do what you need you can specify your own ws
token.
We've used <.ws>
instead of <ws>
because, like \s*
, use of <.ws>
avoids additional capture of whitespace that would likely just clutter up the parse tree and waste memory.
One often wants something like <.ws>
after almost every token in higher level parsing rules that string tokens together. But if it were just explicitly written like that it would be highly repetitive and distracting <.ws>
and [ ... <.ws> ]
boilerplate. To avoid that there's a built in shortcut for implicitly expressing a default assumption of inserting the boilerplate for you. This shortcut is the rule
declarator, which in turn uses :sigspace
.
Using rule
(which uses :sigspace
)
A rule
is exactly the same as a token
except that it switches on :sigspace
at the start of the pattern:
rule { <hex_array>+ }
token { :sigspace <hex_array>+ } # exactly the same thing
Without :sigspace
(so in token
s and regex
s by default), all literal spaces in a pattern (unless you quote them) are ignored. This is generally desirable for readable patterns of individual token
s because they typically specify literal things to match.
But once :sigspace
is in effect, spaces after atoms become "significant" -- because they're implicitly converted to <.ws>
or [ ... <.ws> ]
calls. This is desirable for readable patterns specifying sequences of tokens or subrules because it's a natural way to avoid the clutter of all those extra calls.
The first pattern below will match one or more hex_array
tokens with no spaces being matched either between them or at the end. The last two will match one or more hex_array
s, without intervening spaces, and then with or without spaces at the very end:
token TOP { <hex_array>+ }
# ^ ignored ^ ^ ignored
token TOP { :sigspace <hex_array>+ }
# ^ ignored ^ ^ significant
rule TOP { <hex_array>+ }
# ^ ignored ^ ^ significant
NB. Adverbs (like :sigspace
) aren't atoms. Spaces immediately before the first atom (in the above, spaces before <hex_array>
) are never significant (regardless of whether :sigspace
is or isn't in effect). But thereafter, if :sigspace
is in effect, all non-quoted spacing in the pattern is "significant" -- that is, it's converted to <.ws>
or [ ... <.ws> ]
.
In the above code, the second token and the rule would match a single hex_array
with spaces after it because the space immediately after the +
and before the }
means the pattern is rewritten to:
token TOP { <hex_array>+ <.ws> }
But this rewritten token won't match if your input has multiple hex_array
tokens with one or more spaces between them. Instead you would want to write:
rule TOP { <hex_array> + }
# ignored ^ ^ ^ both these spaces are significant
which is rewritten to:
token TOP { [ <hex_array> <.ws> ]+ <.ws> }
This will match your input.
Conclusion
So, after all that apparent complexity, which is really just me being exhaustively precise, I'm suggesting you might write your original code as:
my $a = "39 3A 3B ";
grammar Hex {
rule TOP { <hex_array> + }
token hex_array { <[0..9 A..F]>**1..2 }
};
Hex.parse($a);
and this would match more flexibly than your original (I'm presuming that would be a good thing though of course it might not be for some use cases) and would perhaps be easier to read for most P6ers.
Finally, to reinforce how to avoid two of the three gotchyas of rule
s, see also What's the best way to be lax on whitespace in a perl6 grammar?. (The third gotchya is whether you need to put a space between an atom and a quantifier, as with the space between the <hex_array>
and the +
in the above.)
Footnotes
1 If you want to match multiple characters, then append a suitable quantifier to the character class. This is a sensible way for things to be, and the assumed behavior of a "character class" according to Wikipedia. Unfortunately the P6 doc currently confuses the issue, eg lumping together both genuine character classes and other rules that match multiple characters under the heading Predefined character classes.
2 The default ws
rule is designed to match between words, where a "word" is a contiguous sequence of letters (Unicode category L), digits (Nd), or underscores. In code, it's specified as:
regex ws { <!ww> \s* }
ww
is a "within word" test. So <!ww>
means not within a "word". <ws>
will always succeed where \s*
would -- except that, unlike \s*
, it won't succeed in the middle of a word. (Like any other atom quantified with a *
, a plain \s*
will always match because it matches any number of spaces, including none at all.)