How to parse a clojure expression?
This answer is based on the Mathematica functional parsers package FunctionalParsers.m used and described in these blog posts.
This command loads that package:
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/FunctionalParsers.m"]
Using direct parser definitions (first answer)
Consider the EBNF:
<space> = { 'WhitespaceCharacter' } ;
<var> = { 'WordCharacter' } ;
<number> = { 'DigitCharacter' } ;
<atom> = <var> | <number> ;
<vector> = '[' , <atom> , { <space> , <atom> } , ']' ;
<arg> = <atom> | <vec> | <tree> ;
<args-list> = <arg> , { <space> , <arg> } ;
<head> = <var> ;
<tree> = '(' , <head> , [ <space> , <args-list> ] , ')' ;
Define the corresponding parsers:
pSPACE = ParsePredicate[StringMatchQ[#, WhitespaceCharacter ..] &];
pVAR = var[StringJoin[#]] &\[CircleDot]ParseMany1[
ParsePredicate[
StringMatchQ[#,
Except[{"(", ")", "[", "]", WhitespaceCharacter}] ..] &]];
pNUMBER =
num[ToExpression[StringJoin[#]]] &\[CircleDot]ParseMany1[
ParsePredicate[StringMatchQ[#, DigitCharacter ..] &]];
pATOM = ParseSpaces[pNUMBER\[CirclePlus]pVAR];
pVECTOR =
vec @@ # &\[CircleDot]ParseShortest[
ParseBracketed[ParseListOf[pATOM, pSPACE]]];
pARGSLIST =
ParseShortest[
ParseListOf[
ParseSpaces[pATOM\[CirclePlus]pVECTOR\[CirclePlus]pTREE], pSPACE]];
pHEAD = hd @@ # &\[CircleDot]pVAR;
ParseRecursiveDefinition[pTREE,
clj @@ Join[{#[[1]]}, #[[2]]] &\[CircleDot]ParseParenthesized[
pHEAD\[CircleTimes]ParseOption[
pSPACE \[RightTriangle] pARGSLIST]]];
Note that pVAR
is defined for a larger set of strings than <var>
in the EBNF. Here is a prettier view on that code:
\[CirclePlus]
, \[CircleTimes]
, and \[CircleDot]
correspond to the parsers ParseAlternativeComposition
, ParseSequentialComposition
, and ParseApply
respectively. A triangle points to the parser the output of which is picked up in a sequential composition.
Parse the example in the question:
res = pTREE[Characters["(defn double-int [i] (* i 2))"]]
(* {{{}, clj[hd["defn"], {var["double-int"], vec[var["i"]],
clj[hd["*"], {var["i"], num[2]}]}]}} *)
and plot the result:
TreeForm[res[[1, 2]]]
Here are other parsing examples:
In[164]:= pTREE[Characters["(defn myfun [i j k] (list (* i 2) (+ j 5) (sq k)))"]]
Out[164]= {{{}, clj[hd["defn"], {var["myfun"], vec[var["i"], var["j"], var["k"]],
clj[hd["list"], {clj[hd["*"], {var["i"], num[2]}],
clj[hd["+"], {var["j"], num[5]}], clj[hd["sq"], {var["k"]}]}]}]}}
In[165]:= pTREE[Characters["(defn myfun [i j k] (list))"]]
Out[165]= {{{}, clj[hd["defn"], {var["myfun"], vec[var["i"], var["j"], var["k"]],
clj[hd["list"]]}]}}
Using parser generation from EBNF (update)
Instead of defining the parsers for each of rule of the EBNF grammar we can use the FunctionalParsers.m package to generate the parsers from the EBNF grammar.
ebnfCode = "
<tree> = '(' &> <head>, [ <args-list> ] <& ')' <@ clj@@#& ;
<head> = <var> <@ hd@@#& ;
<args-list> = { <arg> } ;
<arg> = <atom> | <vector> | <tree>;
<atom> = <var>|<number>;
<var> = '_WordString' <@ var ;
<number> = '_?NumberQ' <@ num ;
<vector> = '[' &> { <atom> } <& ']' <@ vec@@#& ;
";
The strings "&>" and "<&" stand for "parse sequentially and pick right" and "parse sequentially and pick left" respectively. The string "<@" stands for "apply to the parsing result of #1 the modifier function #2".
This command generates the parsers:
GenerateParsersFromEBNF[ParseToEBNFTokens[ebnfCode]];
For each EBNF rule a separate parser is made. E.g. for <vector> = ...
the parser pVECTOR
is made.
The generated definition of the parser pVAR
has to be extended in order to allow the symbols "-", "_", etc. to be in the variable names.
pVAR = var\[CircleDot]ParsePredicate[
StringMatchQ[#, Except[{"(", ")", "[", "]"}] ..] &];
Let us define a dedicated tokenizer for the considered Clojure expressions:
ToClojureTokens = ParseToTokens[#, {"(", ")", "[", "]"}, {" ", "\n"}] &;
With the tokenizer we can make the following table of parsing results:
statements = {"[7 34 5 72]", "(ar [7 34 5 72])", "(m 3 (p 4 6))",
"(defn doubleint [i] (* i 2))", "(list (* i 2) (+ j 5) (sq k))",
"(defn myfun [i j k] (list (* i 2) (+ j 5) (sq k)))"};
ParsingTestTable[pTREE, statements,
"TokenizerFunction" -> ToClojureTokens, "Layout" -> "Vertical"]
Here are the tree forms of the parsed expressions in the table:
Map[TreeForm@*First@*Reverse@*First@*pTREE@*ToClojureTokens, statements]
Other discussions
The functional parsers approach has been discussed also in answers of other MSE questions.
"CSS Selectors for Symbolic XML" ;
"General strategies to write big code in Mathematica?"
I have a solution which is brittle only in that the Clojure expressions "(", ")" could mess up bracket detection and that I rely on strings like "<4>" to keep track of things. It's not very brittle once string handling is properly treated. Don't know if it's any better than yours, but it was fun to write and I think it's rather slick, modulo aforementioned fixable brittleness. Special mention to the use of a closure varname
to count the leafs.
parse
just takes a depth-1 Clojure expression like(head arg1 arg2 arg3)
and returns"head"["arg1", "arg2", "arg3"]
.prepare
replaces[contents]
with(List contents)
in the string.splitTree
breaks out the expression into{top-expression, {{label -> sub-expression, label -> sub-expression, …}}}
; see example below.
parse[expr_String?(StringFreeQ[StringTake[#, 2 ;; -2], "("] &)] :=
(#[[1]] @@ Rest[#]) &@
StringSplit[StringTake[expr, 2 ;; -2], " "]
varname[start_: 0] :=
Module[{p = start}, (p++; "<" <> ToString[p] <> ">") &]
prepare[str_] :=
(
newvars = varname[];
StringReplace[str,
Shortest["[" ~~ x__ ~~ "]"] :> "(List " <> x <> ")"];
)
splitTree[str_] :=
Reap@FixedPoint[
StringReplace[#,
x : r__ ~~ int : Shortest["(" ~~ __ ~~ ")"] ~~ s__ :>
(With[{var = newvars[]}, Sow[var -> int];
r <> var <> s])] &, str]
Example usage:
prepare["(defn double-int [i] (* i (+ 3 4)))"]
outputs
"(defn double-int (List i) (* i (+ 3 4)))"
then
output = MapAt[parse, splitTree[%], {2, 1, All, 2}]
outputs
{"(defn double-int <3> <2>)",
{{"<1>" -> "+"["3", "4"],
"<2>" -> "*"["i", "<1>"],
"<3>" -> "List"["i"]}}}
then
parse[output[[1]]] //. output[[2, 1]]
outputs
"defn"["double-int", "List"["i"], "*"["i", "+"["3", "4"]]]