Which tool to use to parse programming languages in Python?
Antlr is what you should look at http://www.antlr.org
Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget
I really like pyPEG. Its error reporting isn't very friendly, but it can add source code locations to the AST.
pyPEG doesn't have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I've used pyPEG to build a parser for subset of C# with surprisingly little work.
An example adapted from fdik.org/pyPEG/: A simple language like this:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG parser for that language:
def comment(): return [re.compile(r"//.*"),
re.compile("/\*.*?\*/", re.S)]
def literal(): return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol(): return re.compile(r"\w+")
def operator(): return re.compile(r"\+|\-|\*|\/|\=\=")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return (keyword("if"), "(", expression, ")", block,
keyword("else"), block)
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
pyPEG (a tool I authored) has a tracing facility for error reporting.
Just set pyPEG.print_trace = True
and pyPEG will give you a full trace of what's happening inside.
I would recommend that you check out my library: https://github.com/erezsh/lark
It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.
It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.