PEG for Python style indentation

I think an indentation-sensitive language like that is context-sensitive. I believe PEG can only do context-free langauges.

Note that, while nalply's answer is certainly correct that PEG.js can do it via external state (ie the dreaded global variables), it can be a dangerous path to walk down (worse than the usual problems with global variables). Some rules can initially match (and then run their actions) but parent rules can fail thus causing the action run to be invalid. If external state is changed in such an action, you can end up with invalid state. This is super awful, and could lead to tremors, vomiting, and death. Some issues and solutions to this are in the comments here: https://github.com/dmajda/pegjs/issues/45

So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.

Pure PEG cannot parse indentation.

But peg.js can.

I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.

For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js

/* Initializations */
{
  function start(first, tail) {
    var done = [first[1]];
    for (var i = 0; i < tail.length; i++) {
      done = done.concat(tail[i][1][0])
      done.push(tail[i][1][1]);
    }
    return done;
  }

  var depths = [0];

  function indent(s) {
    var depth = s.length;

    if (depth == depths[0]) return [];

    if (depth > depths[0]) {
      depths.unshift(depth);
      return ["INDENT"];
    }

    var dents = [];
    while (depth < depths[0]) {
      depths.shift();
      dents.push("DEDENT");
    }

    if (depth != depths[0]) dents.push("BADDENT");

    return dents;
  }
}

/* The real grammar */
start   = first:line tail:(newline line)* newline? { return start(first, tail) }
line    = depth:indent s:text                      { return [depth, s] }
indent  = s:" "*                                   { return indent(s) }
text    = c:[^\n]*                                 { return c.join("") }
newline = "\n"                                     {}

depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.

peg.js produces for the text:

alpha
  beta
  gamma
    delta
epsilon
    zeta
  eta
theta
  iota

these results:

[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]

This tokenizer even catches bad indents.

PEG for Python style indentation

Tags:

Syntax

Language Design

Parsing

Treetop

Peg

Related

Recent Posts