Tokenize Mathematica input in a simple way

tokenize[str_] := Module[{exp,
   nb = CreateDocument[{ExpressionCell@
       InputForm@MakeExpression[str, StandardForm]}, 
     Visible -> False]},
  SelectionMove[nb, Next, Cell];
  exp = Flatten[
    NotebookRead[nb][[1, 1]] /. {RowBox -> List, 
      i_String /; StringMatchQ[i, Whitespace ..] :> Sequence[]}];
  NotebookClose[nb];
  exp[[3 ;;-2]]
  ]

Haven't tested this much. Does this give the output you expect?

tokenize["Plot3D[{x^2+y^2,-x^2-y^2},{x,-2,2},{y,-2,2},\
RegionFunction->Function[{x,y,z},x^2+y^2<=4]]"]

(*{"Plot3D","[","{","x","^","2","+","y","^","2",",","-","x","^","2","-\
","y","^","2","}",",","{","x",",","-","2",",","2","}",",","{","y",",",\
"-","2",",","2","}",",","RegionFunction","->","Function","[","{","x",\
",","y",",","z","}",",","x","^","2","+","y","^","2","<=","4","]","]",\
"]"}*)

EDIT

Thanks to @JohnFultz's recent introduction of the following front end undocumented function, this becomes straightforward

 fultzTokenize[t_String]:=Cases[MathLink`CallFrontEnd[
   FrontEnd`UndocumentedTestFEParserPacket[t, False]], _String, Infinity]

I am a developer at Wolfram Research and I am trying to share some of the work I have been doing with parsing WL code.

I have written a package for parsing WL code and retaining interesting metadata, such as file and line information.

I also expose a tokenization function.

The paclet is available on the public paclet server:

In[1]:= PacletUpdate["AST","Site"->"http://pacletserver.wolfram.com","UpdateSites"->True]
Out[1]= Paclet[AST,0.8.1,<>]

Load the AST package:

Needs["AST`"]

The AST package has a function TokenizeString that returns a list of tokens when it is given WL input:

In[2]:= TokenizeString["Plot3D[{x^2+y^2,-x^2-y^2},{x,-2,2},{y,-2,2},RegionFunction->Function[{x,y,z},x^2+y^2<=4]]"]
Out[2]= {Token[Token`Symbol,Plot3D,<|Source->{{1,1},{1,6}}|>],
Token[Token`Operator`OpenSquare,[,<|Source->{{1,7},{1,7}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,8},{1,8}}|>],
Token[Token`Symbol,x,<|Source->{{1,9},{1,9}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,10},{1,10}}|>],
Token[Token`Number,2,<|Source->{{1,11},{1,11}}|>],
Token[Token`Operator`Plus,+,<|Source->{{1,12},{1,12}}|>],
Token[Token`Symbol,y,<|Source->{{1,13},{1,13}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,14},{1,14}}|>],
Token[Token`Number,2,<|Source->{{1,15},{1,15}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,16},{1,16}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,17},{1,17}}|>],
Token[Token`Symbol,x,<|Source->{{1,18},{1,18}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,19},{1,19}}|>],
Token[Token`Number,2,<|Source->{{1,20},{1,20}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,21},{1,21}}|>],
Token[Token`Symbol,y,<|Source->{{1,22},{1,22}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,23},{1,23}}|>],
Token[Token`Number,2,<|Source->{{1,24},{1,24}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,25},{1,25}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,26},{1,26}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,27},{1,27}}|>],
Token[Token`Symbol,x,<|Source->{{1,28},{1,28}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,29},{1,29}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,30},{1,30}}|>],
Token[Token`Number,2,<|Source->{{1,31},{1,31}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,32},{1,32}}|>],
Token[Token`Number,2,<|Source->{{1,33},{1,33}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,34},{1,34}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,35},{1,35}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,36},{1,36}}|>],
Token[Token`Symbol,y,<|Source->{{1,37},{1,37}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,38},{1,38}}|>],
Token[Token`Operator`Minus,-,<|Source->{{1,39},{1,39}}|>],
Token[Token`Number,2,<|Source->{{1,40},{1,40}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,41},{1,41}}|>],
Token[Token`Number,2,<|Source->{{1,42},{1,42}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,43},{1,43}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,44},{1,44}}|>],
Token[Token`Symbol,RegionFunction,<|Source->{{1,45},{1,58}}|>],
Token[Token`Operator`MinusGreater,->,<|Source->{{1,59},{1,60}}|>],
Token[Token`Symbol,Function,<|Source->{{1,61},{1,68}}|>],
Token[Token`Operator`OpenSquare,[,<|Source->{{1,69},{1,69}}|>],
Token[Token`Operator`OpenCurly,{,<|Source->{{1,70},{1,70}}|>],
Token[Token`Symbol,x,<|Source->{{1,71},{1,71}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,72},{1,72}}|>],
Token[Token`Symbol,y,<|Source->{{1,73},{1,73}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,74},{1,74}}|>],
Token[Token`Symbol,z,<|Source->{{1,75},{1,75}}|>],
Token[Token`Operator`CloseCurly,},<|Source->{{1,76},{1,76}}|>],
Token[Token`Operator`Comma,,,<|Source->{{1,77},{1,77}}|>],
Token[Token`Symbol,x,<|Source->{{1,78},{1,78}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,79},{1,79}}|>],
Token[Token`Number,2,<|Source->{{1,80},{1,80}}|>],
Token[Token`Operator`Plus,+,<|Source->{{1,81},{1,81}}|>],
Token[Token`Symbol,y,<|Source->{{1,82},{1,82}}|>],
Token[Token`Operator`Caret,^,<|Source->{{1,83},{1,83}}|>],
Token[Token`Number,2,<|Source->{{1,84},{1,84}}|>],
Token[Token`Operator`LessEqual,<=,<|Source->{{1,85},{1,86}}|>],
Token[Token`Number,4,<|Source->{{1,87},{1,87}}|>],
Token[Token`Operator`CloseSquare,],<|Source->{{1,88},{1,88}}|>],
Token[Token`Operator`CloseSquare,],<|Source->{{1,89},{1,89}}|>],
Token[Token`EOF,,<|Source->{{2,0},{2,0}}|>]}

The AST paclet is under development and the format of the output may change, but hopefully this can help.


This, with a suitable transform function to traverse the tree, would be an adequate tokenizer:

TreeForm[Hold[
  Plot3D[{x^2 + y^2, -x^2 - y^2}, {x, -2, 2}, {y, -2, 2}, 
   RegionFunction -> Function[{x, y, z}, x^2 + y^2 <= 4]]]]

Mathematica graphics

Tags:

Parsing

Boxes