How does backtracking affect the language recognized by a parser?
The problem isn't the fact that this is a backtracking or recursive descent parser; the problem is that the described implementation does not properly consider the outer context of the recursive descent parse. This is similar to the difference between a Strong LL (SLL) parser and an LL parser.
The shortest input for which the strange behavior is demonstrated is aaaaaa
.
- We start in rule
S
, and match the 1sta
. - We invoke
S
.- We match the 2nd
a
. - We invoke
S
. I'll omit the specific steps, but the key is this invocation ofS
matchesaaaa
, which is the 3rda
through the end of the input. (See note that follows.) - We try to match
a
, but since the end of the input was already reached, we go back and match just the 2nd through 3rdaa
.
- We match the 2nd
- We match the 4th
a
.
Additional note about the inner call to S
that matched aaaa
: If we knew to reserve an a
at the end of the input for step 3, then the inner call to S
could have matched aa
instead of aaaa
, leading to a successful parse of the complete input aaaaaa
. ANTLR 4 provides this "full context" parsing ability in a recursive descent parser, and is the first recursive descent LL parser able to correctly match aa
instead of aaaa
for this nested invocation of S
.
An SLL parser matches a2k for this grammar. A proper LL parser (such as ANTLR 4) matches a2k for this grammar.
Even with backtracking, which requires being able to rewind the input stream, a recursive descent parser is not allowed to look ahead to the end of the input, nor is it allowed to remove symbols from both ends of the stream.
A left-to-right parser must be able to work with an input stream which has only one method:
get() : consume and read one symbol, or return an EOF symbol.
The backtracking version needs a stream with two more methods:
posn = tell() : return an opaque value which can be used in seek()
seek(posn) : reposition the stream to a previous position returned by tell()