Equation (expression) parser with precedence?

The hard way

You want a recursive descent parser.

To get precedence you need to think recursively, for example, using your sample string,

1+11*5

to do this manually, you would have to read the 1, then see the plus and start a whole new recursive parse "session" starting with 11... and make sure to parse the 11 * 5 into its own factor, yielding a parse tree with 1 + (11 * 5).

This all feels so painful even to attempt to explain, especially with the added powerlessness of C. See, after parsing the 11, if the * was actually a + instead, you would have to abandon the attempt at making a term and instead parse the 11 itself as a factor. My head is already exploding. It's possible with the recursive decent strategy, but there is a better way...

The easy (right) way

If you use a GPL tool like Bison, you probably don't need to worry about licensing issues since the C code generated by bison is not covered by the GPL (IANAL but I'm pretty sure GPL tools don't force the GPL on generated code/binaries; for example Apple compiles code like say, Aperture with GCC and they sell it without having to GPL said code).

Download Bison (or something equivalent, ANTLR, etc.).

There is usually some sample code that you can just run bison on and get your desired C code that demonstrates this four function calculator:

http://www.gnu.org/software/bison/manual/html_node/Infix-Calc.html

Look at the generated code, and see that this is not as easy as it sounds. Also, the advantages of using a tool like Bison are 1) you learn something (especially if you read the Dragon book and learn about grammars), 2) you avoid NIH trying to reinvent the wheel. With a real parser-generator tool, you actually have a hope at scaling up later, showing other people you know that parsers are the domain of parsing tools.


Update:

People here have offered much sound advice. My only warning against skipping the parsing tools or just using the Shunting Yard algorithm or a hand rolled recursive decent parser is that little toy languages1 may someday turn into big actual languages with functions (sin, cos, log) and variables, conditions and for loops.

Flex/Bison may very well be overkill for a small, simple interpreter, but a one off parser+evaluator may cause trouble down the line when changes need to be made or features need to be added. Your situation will vary and you will need to use your judgement; just don't punish other people for your sins [2] and build a less than adequate tool.

My favorite tool for parsing

The best tool in the world for the job is the Parsec library (for recursive decent parsers) which comes with the programming language Haskell. It looks a lot like BNF, or like some specialized tool or domain specific language for parsing (sample code [3]), but it is in fact just a regular library in Haskell, meaning that it compiles in the same build step as the rest of your Haskell code, and you can write arbitrary Haskell code and call that within your parser, and you can mix and match other libraries all in the same code. (Embedding a parsing language like this in a language other than Haskell results in loads of syntactic cruft, by the way. I did this in C# and it works quite well but it is not so pretty and succinct.)

Notes:

1 Richard Stallman says, in Why you should not use Tcl

The principal lesson of Emacs is that a language for extensions should not be a mere "extension language". It should be a real programming language, designed for writing and maintaining substantial programs. Because people will want to do that!

[2] Yes, I am forever scarred from using that "language".

Also note that when I submitted this entry, the preview was correct, but SO's less than adequate parser ate my close anchor tag on the first paragraph, proving that parsers are not something to be trifled with because if you use regexes and one off hacks you will probably get something subtle and small wrong.

[3] Snippet of a Haskell parser using Parsec: a four function calculator extended with exponents, parentheses, whitespace for multiplication, and constants (like pi and e).

aexpr   =   expr `chainl1` toOp
expr    =   optChainl1 term addop (toScalar 0)
term    =   factor `chainl1` mulop
factor  =   sexpr  `chainr1` powop
sexpr   =   parens aexpr
        <|> scalar
        <|> ident

powop   =   sym "^" >>= return . (B Pow)
        <|> sym "^-" >>= return . (\x y -> B Pow x (B Sub (toScalar 0) y))

toOp    =   sym "->" >>= return . (B To)

mulop   =   sym "*" >>= return . (B Mul)
        <|> sym "/" >>= return . (B Div)
        <|> sym "%" >>= return . (B Mod)
        <|>             return . (B Mul)

addop   =   sym "+" >>= return . (B Add) 
        <|> sym "-" >>= return . (B Sub)

scalar = number >>= return . toScalar

ident  = literal >>= return . Lit

parens p = do
             lparen
             result <- p
             rparen
             return result

The shunting yard algorithm is the right tool for this. Wikipedia is really confusing about this, but basically the algorithm works like this:

Say, you want to evaluate 1 + 2 * 3 + 4. Intuitively, you "know" you have to do the 2 * 3 first, but how do you get this result? The key is to realize that when you're scanning the string from left to right, you will evaluate an operator when the operator that follows it has a lower (or equal to) precedence. In the context of the example, here's what you want to do:

  1. Look at: 1 + 2, don't do anything.
  2. Now look at 1 + 2 * 3, still don't do anything.
  3. Now look at 1 + 2 * 3 + 4, now you know that 2 * 3 has to to be evaluated because the next operator has lower precedence.

How do you implement this?

You want to have two stacks, one for numbers, and another for operators. You push numbers onto the stack all the time. You compare each new operator with the one at the top of the stack, if the one on top of the stack has higher priority, you pop it off the operator stack, pop the operands off the number stack, apply the operator and push the result onto the number stack. Now you repeat the comparison with the top of stack operator.

Coming back to the example, it works like this:

N = [ ] Ops = [ ]

  • Read 1. N = [1], Ops = [ ]
  • Read +. N = [1], Ops = [+]
  • Read 2. N = [1 2], Ops = [+]
  • Read *. N = [1 2], Ops = [+ *]
  • Read 3. N = [1 2 3], Ops = [+ *]
  • Read +. N = [1 2 3], Ops = [+ *]
    • Pop 3, 2 and execute 2*3, and push result onto N. N = [1 6], Ops = [+]
    • + is left associative, so you want to pop 1, 6 off as well and execute the +. N = [7], Ops = [].
    • Finally push the [+] onto the operator stack. N = [7], Ops = [+].
  • Read 4. N = [7 4]. Ops = [+].
  • You're run out off input, so you want to empty the stacks now. Upon which you will get the result 11.

There, that's not so difficult, is it? And it makes no invocations to any grammars or parser generators.


http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm

Very good explanation of different approaches:

  • Recursive-descent recognition
  • The shunting yard algorithm
  • The classic solution
  • Precedence climbing

Written in simple language and pseudo-code.

I like 'precedence climbing' one.