Standalone lexers with lex: synopsis, examples, and pitfalls
[shared via Google Reader from Matt Might’s blog]
Lexical analysis is the first phase of compilation.
During this phase, source code received character-by-character is transformed into a sequence of “tokens.”
For example, for the following Python expression:
print (3 + x
*2 ) # comment
the resulting stream of tokens might be (encoded as S-Expressions as):
(keyword "print")
(delim "(")
(int 3)
(punct "+")
(id "x")
(punct "*")
(int 2)
(delim ")")
Lexical analysis strips away insignificant whitespace and comments, and it groups the remaining characters into individual tokens.
Using the Unix tool lex, it’s possible
to create “standalone” lexers.
In a compiler/interpreter toolchain where each phase is standalone, a shell pipeline plugs them together, e.g.:
tokenize < input | parse | interpret
Read on for an introduction to lex
with synopses, examples and pitfalls.
Worked examples include:
- a comment-density calculator for C;
- a desugarer for Python-like significant whitespace; and
- a standalone lexer for the obligatory calc language.