Writing a syntax highlighter

Building a syntax highlighter is all about finding specific keywords in the code and giving them a specific style (font, font style, colour etc.). In order to achieve this, you will need to define a list of keywords specific to the programming language in which the code is written, and then parse the text (e.g. using regular expressions), find the specific tokens and replace them with properly-styled HTML tags.

A very basic highligher written in JavaScript would look like this:

var keywords = [ "public", "class", "private", "static", "return", "void" ];
for (var i = 0; i < keywords.length; i++)
{
        var regex = new RegExp("([^A-z0-9])(" + keywords[i] + ")([^A-z0-9])(?![^<]*>|[^<>]*</)", "g");
        code = code.replace(regex, "$1<span class='rm-code-keyword'>$2</span>$3");
}

Syntax highlighters can work in two very general ways. The first implements a full lexer and parser for the language(s) being highlighted, exactly identifying each token's type (keyword, class name, instance name, variable type, preprocessor directive...). This provides all the information needed to exactly highlight the code according to some specification (keywords in red, class names in blue, what have you).

The second way is something like the one Google Code Prettify employs, where instead of implementing one lexer/parser per language, a couple of very general parsers are used that can do a decent job on most syntaxes. This highlighter, for example, will be able to parse and highlight reasonably well any C-like language, because its lexer/parser can identify the general components of those kinds of languages.

This also has the advantage that, as a result, you don't need to explicitely specify the language, as the engine will determine by itself which of its generic parsers can do the best job. The downside of course is that highlighting is less perfect than when a language-specific parser is used.


In StackOverflow podcast number 50 Steve Yegge talks a little about his project for creating some general highlight mechanism. Not a finished product and maybe more sophisticated than you are looking for, but there could be something of interest.


A good start to one approach for this is the Udacity course CS262. The title is building a web browser, but really the class focuses on exactly the problems you are looking for - how to parse and lex a set of text. In your case, you'd use that info for highlighting. I just took it and it was very good. The course is "over" now, but the videos and practice problems/homeworks are still up and available for viewing.