How would you go about parsing Markdown?

The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg.


EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in OCaml, it is extremely simple and short (268 SLOC for the parser, 43 SLOC for the HTML emitter), yet blazingly fast (20% faster than discount (written in hand-optimized C) and sixhundred times faster than BlueCloth (Ruby)), despite the fact that it isn't even optimized for performance yet. Because it is only intended for internal use by Mauricio himself for his weblog, there are a few deviations from the official Markdown specification, but Mauricio has created a branch which reverts most of those changes.


If I was to try to parse markdown (and its extension Markdown extra) I think I would try to use a state machine and parse it one char at a time, linking together some internal structures representing bits of text as I go along then, once all is parsed, generating the output from the objects all stringed together.

Basically, I'd build a mini-DOM-like tree as I read the input file.
To generate an output, I would just traverse the tree and output HTML or anything else (PS, LaTex, RTF,...)

Things that can increase complexity:

  • The fact that you can mix HTML and markdown, although the rule could be easy to implement: just ignore anything that's between two balanced tags and output it verbatim.

  • URLs and notes can have their reference at the bottom of the text. Using data structures for hyperlinks could simply record something like:

    [my text to a link][linkkey]
    results in a structure like: 
        URLStructure: 
        |  InnerText : "my text to a link"
        |  Key       : "linkkey"
        |  URL       : <null>
    
  • Headers can be defined with an underline, that could force us to use a simple data structure for a generic paragraph and modify its properties as we read the file:

    ParagraphStructure:
    |  InnerText    : the current paragraph text 
    |                 (beginning of line until end of line).
    |  HeadingLevel : <null> or 1-4 when we can assess 
    |                 that paragraph heading level, if any.
    

Anyway, just some thoughts.

I'm sure that there are many small details to take care of and I'm pretty sure that Regexes could become handy during the process.
After all, they were meant to process text.


I released a new parser-based Markdown Java implementation last week, called pegdown. pegdown uses a PEG parser to first build an abstract syntax tree, which is subsequently written out to HTML. As such it is quite clean and much easier to read, maintain and extend than a regex based approach. The PEG grammar is based on John MacFarlanes C implementation "peg-markdown".

Maybe something of interest to you...