I have often seen developers ask questions like “How do I parse HTML with a Regular Expression?” The correct response to that is You don’t, a regex can’t do that, what you need is a context free grammar! Context free grammars parse context free languages, which includes many programming languages and DSLs these days.

A context free grammar contains a set of replacement rules. Imagine you wanted to create a context free grammar for basic arithmetic, you might end up with something like this:

EXPRESSION -> number
EXPRESSION -> ( EXPRESSION )
EXPRESSION -> EXPRESSION + EXPRESSION
EXPRESSION -> EXPRESSION - EXPRESSION
EXPRESSION -> EXPRESSION * EXPRESSION
EXPRESSION -> EXPRESSION / EXPRESSION

This is a recursive definition that says an expression can be either a number, or another expression in brackets, or an two expressions separated by one of the 4 operators you would expect from basic math.

Of course, having this formalism is not really helpful unless you can translate it to code that you can easily use in your project.

In order do that you need two tools, a lexer and a parser. The original versions of these tools were called “Lex” and “Yacc”, Lex takes a series of regular expressions and uses them to turn your input into a sequence of tokens, and then Yacc (Yet another compiler compiler) takes those and fits them into your grammar.

Those original tools were written in C, and then were updated later by the free software foundation to “Flex” and “Bison”, which also produce C code. In addition versions of these tools have been ported to a large number of other languages. In some cases like Erlang they are part of the language’s standard library while in the case of JavaScript or Python they can be found in the normal package library.

Here are some versions of Lex and Yacc for various languages, if your language of choice is not listed here then google will probably turn something up

Language Lexer Parser
C Flex Bison
C Lex Yacc
Javascript Jison Jison
Python PLY PLY
Erlang Leex Yeec
Elixir Leex Yeec

The nice thing about this is since the original Lex and Yacc were written back in the 1970’s and form the basis of many compilers the structure of these is very well understood.

Here is an example taken from the jison project that implements a basic calculator as described above but with a few more rules.

In javascript the Jison project combines but the lexer and the parser into one file, others may do it differently. This file starts out with a lexer that matches a number of regular expressions, and tells it to return different things depending on what is found, it can match numbers, symbols like “%” or “+” and even the constant “PI”.

The 2nd section (starting on line 26) expresses operator precedence which is how it knows that you should do multiplication before additon and such.

The final section starts on line 40 defines what expressions the language can express. The symbol “$$” is the return value of the block and “$1”, “$2”, $3” etc are the first, second and third value of the text.

The nice thing about this code is how declarative it is the rule for addition looks like this ` : e ‘+’ e {$$ = $1 + $3;}` which can be read as the result of the first sub expression plus the third one.

The nice thing is that the actual code to run the parser is rather complex, but you don’t have to write it, a tool does that for you. Of course, the exact way that you integrate this into a larger project depends a bit on which tool and language your are using.

But now you know how to make a calculator in javascript without using eval().

If you want to know more about lexers and parsers you should find a book on building compilers, there are also a large number of resources on the web that can help you.