Tokenize Mathematica input in a simple way
tokenize[str_] := Module[{exp,
nb = CreateDocument[{ExpressionCell@
InputForm@MakeExpression[str, StandardForm]},
Visible -> False]},
SelectionMove[nb, Next, Cell];
exp = Flatten[
NotebookRead[nb][[1, 1]] /. {RowBox -> List,
i_String /; StringMatchQ[i, Whitespace ..] :> Sequence[]}];
NotebookClose[nb];
exp[[3 ;;-2]]
]
Haven't tested this much. Does this give the output you expect?
tokenize["Plot3D[{x^2+y^2,-x^2-y^2},{x,-2,2},{y,-2,2},\
RegionFunction->Function[{x,y,z},x^2+y^2<=4]]"]
(*{"Plot3D","[","{","x","^","2","+","y","^","2",",","-","x","^","2","-\
","y","^","2","}",",","{","x",",","-","2",",","2","}",",","{","y",",",\
"-","2",",","2","}",",","RegionFunction","->","Function","[","{","x",\
",","y",",","z","}",",","x","^","2","+","y","^","2","<=","4","]","]",\
"]"}*)
EDIT
Thanks to @JohnFultz's recent introduction of the following front end undocumented function, this becomes straightforward
fultzTokenize[t_String]:=Cases[MathLink`CallFrontEnd[
FrontEnd`UndocumentedTestFEParserPacket[t, False]], _String, Infinity]
I am a developer at Wolfram Research and I am trying to share some of the work I have been doing with parsing WL code.
I have written a package for parsing WL code and retaining interesting metadata, such as file and line information.
I also expose a tokenization function.
The paclet is available on the public paclet server:
In[1]:= PacletUpdate["AST","Site"->"http://pacletserver.wolfram.com","UpdateSites"->True]
Out[1]= Paclet[AST,0.8.1,<>]
Load the AST package:
Needs["AST`"]
The AST package has a function TokenizeString
that returns a list of tokens when it is given WL input:
In[2]:= TokenizeString["Plot3D[{x^2+y^2,-x^2-y^2},{x,-2,2},{y,-2,2},RegionFunction->Function[{x,y,z},x^2+y^2<=4]]"]
Out[2]= {Token[Token`Symbol,Plot3D,<|Source->{{1,1},{1,6}}|>],
Token[Token`Operator`OpenSquare,[,<|Source->{{1,7},{1,7}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,8},{1,8}}|>],
Token[Token`Symbol,x,<|Source->{{1,9},{1,9}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,10},{1,10}}|>],
Token[Token`Number,2,<|Source->{{1,11},{1,11}}|>],
Token[Token`Operator`Plus,+,<|Source->{{1,12},{1,12}}|>],
Token[Token`Symbol,y,<|Source->{{1,13},{1,13}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,14},{1,14}}|>],
Token[Token`Number,2,<|Source->{{1,15},{1,15}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,16},{1,16}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,17},{1,17}}|>],
Token[Token`Symbol,x,<|Source->{{1,18},{1,18}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,19},{1,19}}|>],
Token[Token`Number,2,<|Source->{{1,20},{1,20}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,21},{1,21}}|>],
Token[Token`Symbol,y,<|Source->{{1,22},{1,22}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,23},{1,23}}|>],
Token[Token`Number,2,<|Source->{{1,24},{1,24}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,25},{1,25}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,26},{1,26}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,27},{1,27}}|>],
Token[Token`Symbol,x,<|Source->{{1,28},{1,28}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,29},{1,29}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,30},{1,30}}|>],
Token[Token`Number,2,<|Source->{{1,31},{1,31}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,32},{1,32}}|>],
Token[Token`Number,2,<|Source->{{1,33},{1,33}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,34},{1,34}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,35},{1,35}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,36},{1,36}}|>],
Token[Token`Symbol,y,<|Source->{{1,37},{1,37}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,38},{1,38}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,39},{1,39}}|>],
Token[Token`Number,2,<|Source->{{1,40},{1,40}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,41},{1,41}}|>],
Token[Token`Number,2,<|Source->{{1,42},{1,42}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,43},{1,43}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,44},{1,44}}|>],
Token[Token`Symbol,RegionFunction,<|Source->{{1,45},{1,58}}|>],
Token[Token`Operator`MinusGreater,->,<|Source->{{1,59},{1,60}}|>],
Token[Token`Symbol,Function,<|Source->{{1,61},{1,68}}|>],
Token[Token`Operator`OpenSquare,[,<|Source->{{1,69},{1,69}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,70},{1,70}}|>],
Token[Token`Symbol,x,<|Source->{{1,71},{1,71}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,72},{1,72}}|>],
Token[Token`Symbol,y,<|Source->{{1,73},{1,73}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,74},{1,74}}|>],
Token[Token`Symbol,z,<|Source->{{1,75},{1,75}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,76},{1,76}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,77},{1,77}}|>],
Token[Token`Symbol,x,<|Source->{{1,78},{1,78}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,79},{1,79}}|>],
Token[Token`Number,2,<|Source->{{1,80},{1,80}}|>],
Token[Token`Operator`Plus,+,<|Source->{{1,81},{1,81}}|>],
Token[Token`Symbol,y,<|Source->{{1,82},{1,82}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,83},{1,83}}|>],
Token[Token`Number,2,<|Source->{{1,84},{1,84}}|>],
Token[Token`Operator`LessEqual,<=,<|Source->{{1,85},{1,86}}|>],
Token[Token`Number,4,<|Source->{{1,87},{1,87}}|>],
Token[Token`Operator`CloseSquare,],<|Source->{{1,88},{1,88}}|>],
Token[Token`Operator`CloseSquare,],<|Source->{{1,89},{1,89}}|>],
Token[Token`EOF,,<|Source->{{2,0},{2,0}}|>]}
The AST paclet is under development and the format of the output may change, but hopefully this can help.
This, with a suitable transform function to traverse the tree, would be an adequate tokenizer:
TreeForm[Hold[
Plot3D[{x^2 + y^2, -x^2 - y^2}, {x, -2, 2}, {y, -2, 2},
RegionFunction -> Function[{x, y, z}, x^2 + y^2 <= 4]]]]