Unable to write a grammar in perl6 for parsing lines with special characters
Note that there is a hidden space at the end of the last line in the $t
string:
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf $1234.00
qwe {rq} [we-r_q] we
Start Invoice Details␣ <-- Space at the end of the line
EOQ
This makes the <invoice-prelude-end>
token fail since it contains the a lookahead regexp <?before 'Start Invoice Details'\n>
. This lookahead does not include a possible space at the end of the line (due to the explicit newline character \n
at the end of the lookahead). Hence, the <invoice-prelude>
rule cannot match either.
A quick fix is to remove the space at the end of the line Start Invoice Details
.
Firstly, the frugal quantifier *?
without a backtracking probably every time match the empty string. You can use regex
instead of rule
.
Secondly, there is a space at the end of the line, which starts with Start Invoice Details
.
rule invoice-prelude-end {<line> <?before 'Start Invoice Details' \n>};
regex invoice-prelude {
<invoice-prelude-start>
<line>*?
<invoice-prelude-end>
<line>
}
If you want to avoid a backtracking, you can use negative lookahead.
token invoice-prelude-end { <line> };
rule invoice-prelude {
<invoice-prelude-start>
[<line> <!before 'Start Invoice Details' \n>]*
<invoice-prelude-end>
<line>
}
Whole example with some changes as inspiration:
use v6;
#use Grammar::Tracer;
grammar invoice {
token ws { <!ww>\h* }
token super-word { \S+ }
token line { <super-word>* % <.ws> }
token invoice-prelude-start { 'Invoice Summary' }
rule invoice-prelude-midline { <line> <!before \n <invoice-details-start> \n> }
token invoice-prelude-end { <line> }
token invoice-details-start { 'Start Invoice Details' }
rule invoice-prelude {
<invoice-prelude-start> \n
<invoice-prelude-midline> * %% \n
<invoice-prelude-end> \n
<invoice-details-start> \n
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf $1234.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t,:rule<invoice-prelude>);
}
TLDR: The issue is that the test input line with Start Invoice Details
ends with horizontal whitespace that you aren't dealing with.
Two ways to deal with it (other than changing the input)
# Explicitly: vvv
token invoice-prelude-end { <line> <?before 'Start Invoice Details' \h* \n>}
# Implicitly:
rule invoice-prelude-end { <line><?before 'Start Invoice Details' \n>}
# ^ must be a rule and there must be a space ^
# (uses the fact that you wrote your own <ws> token)
Following are some more things that I think would be helpful
I would have used the “separated by” feature %
in line
and super-phrase
token super-phrase { <super-word>+ % \h } # single % doesn't capture trailing separator
token line {
^^ \h*
<super-word>* %% \h+ # double %% can capture optional trailing separator
\n
}
Those are [almost] exactly equivalent to what you wrote.
(What you wrote has to fail to match <super-word>
twice in <line>
, but this only has to fail once.)
I would have used the surround feature ~
in invoice-prelude
token invoice-prelude {
# zero or more <line>s surrounded by <invoice-prelude-start> and <invoice-prelude-end>
<invoice-prelude-start> ~ <invoice-prelude-end> <line>*?
<line> # I assume this is here for debugging
}
Note that it didn't actually gain anything by being a rule
because all of the horizontal whitespace is already handled by the rest of the code.
I don't think that the last line of the invoice prelude is special, so remove <line>
from invoice-prelude-end
.
(<line>*?
in invoice-prelude
will capture it instead.)
token invoice-prelude-end {<?before 'Start Invoice Details' \h* \n>}
The only regexs that could benefit from being a rule
is invoice-prelude-start
and invoice-prelude-end
.
rule invoice-prelude-start {^^ Invoice Summary \n}
# `^^` is needed so the space ^ will match <.ws>
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>}
That would only work if you are fine with it matching something like Invoice Summary 
.
Note that invoice-prelude-start
needs to use \n
to capture it, but invoice-prelude-end
can use $$
instead because it isn't capturing \n
anyway.
If you change super-word
to something other than \S+
, then you may also want to change ws
to something like \h+ | <.wb>
. (word boundary)
#! /usr/bin/env perl6
use v6.d;
grammar invoice {
token TOP { # testing
<invoice-prelude>
<line>
}
token ws { \h* | <.wb> };
token super-word { \S+ };
token super-phrase { <super-word>+ % \h }
token line {
^^ \h*
<super-word>* %% \h+
\n
};
rule invoice-prelude-start {^^ Invoice Summary \n}
rule invoice-prelude-end {<?before ^^ Start Invoice Details $$>};
token invoice-prelude {
<invoice-prelude-start> ~ <invoice-prelude-end>
<line>*?
}
}
multi sub MAIN(){
my $t = q :to/EOQ/;
Invoice Summary
asd fasdf
asdfasdf
asd 123-fasdf $1234.00
qwe {rq} [we-r_q] we
Start Invoice Details
EOQ
say $t;
say invoice.parse($t);
}