This is the first post on a series on parsing using the Menhir parser generator for OCaml. I found the existing tutorials insufficient, so I wrote my own.
Menhir is the standard parser generator for OCaml. ocamllex is the standard lexer generator. They are built to work together. We’re using OCaml because it’s a great language for writing compilers.
Preliminaries
Install opam
, the package manager for OCaml, then install
dune
, the OCaml build system, by running:
$ opam install dune
Source Code
The code for this tutorial is in this repository, in the forth
directory.
Project Structure
The project structure is:
forth/
bin/
dune
menhir-tutorial.ml
lib/
dune
Ast.ml
Lexer.mll
Parser.mly
ParserInterface.ml
test/
dune
ParserTest.ml
dune-project
Much of this is build system boilerplate. The important files are:
bin/menhir-tutorial.ml
- The entrypoint for the interactive executable.
lib/Ast.ml
- Types to represent Forth programs.
lib/Lexer.mll
- The lexer source code.
lib/Parser.mly
- The parser source code.
lib/ParserInterface.ml
- An abstraction to simplify calling the parser.
test/ParserTest.ml
- Tests of the parser.
Forth
Forth is a simple language and parsing it is little more than lexing: a Forth program is a string of atoms, where each atom is one of:
- An integer constant, like
123
or-10
. - A floating point constant, like
3.14
or6.1e23
. - A word, which is a basically a function call, like
ADD
orPRIMEP
.
Representing Forth Programs
Menhir won’t write type definitions for us. We have to first define the types to represent Forth programs. Since Forth is very simple, the types are a one-to-one translation of the definition above.
First, an atom
is either an integer constant, or a float constant, or a word:
type atom =
| Int of int
| Float of float
| Word of string
Second, a program is a sequence of atoms
:
type program = Program of atom list
That’s it.
For interactivity, let’s add a couple of functions to turn atoms and programs into human-readable strings:
let dump_atom = function
| Int i ->
"Int " ^ string_of_int i
| Float f ->
"Float " ^ string_of_float f
| Word s ->
"Word \"" ^ s ^ "\""
let dump_program (Program atoms) =
"[" ^ (String.concat ", " (List.map dump_atom atoms)) ^ "]"
The Lexer
Lexing means turning a stream of bytes into a stream of tokens. The lexer essentially applies a series of regular expressions to the input until one matches, and returns the token associated to that regular expression. But it’s still a flat list of tokens. Later, the parser will turn that flat list of tokens into a tree.
The full source code of the lexer is:
let digit = ['0'-'9']
let sign = ['-' '+']
let alpha = ['a'-'z' 'A'-'Z']
let int_constant = sign? digit+
let exponent = ['e' 'E']
let float_constant = sign? digit+ '.' digit+ (exponent sign? digit+)?
let identifier = alpha (alpha | digit | '-')*
let whitespace = [' ' '\t']+
(* Rules *)
rule token = parse
| int_constant { INT_CONSTANT (int_of_string (Lexing.lexeme lexbuf)) }
| float_constant { FLOAT_CONSTANT (float_of_string (Lexing.lexeme lexbuf)) }
| identifier { WORD (Lexing.lexeme lexbuf) }
(* etc. *)
| whitespace { token lexbuf }
| eof { EOF }
| _ { raise (Failure ("Character not allowed in source text: '" ^ Lexing.lexeme lexbuf ^ "'")) }
The lexer is divided into two parts: let
definitions and rules. A simple lexer
like this has one rule, generally strings like string parsing are implemented as
other rules. But we’re not doing string literals in this example yet.
Regular expressions can get very complicated, so we can use let
to define
regular expression fragments. For example, digit
is a regular expression
fragment that matches any digit from 0 to 9:
let digit = ['0'-'9']
sign
matches the plus or minus characters:
let sign = ['-' '+']
And alpha
matches any letter in the alphabet, lowercase or uppercase:
let alpha = ['a'-'z' 'A'-'Z']
The regular expression for integer constants is then:
let int_constant = sign? digit+
Which is simpler, and easier to read, than its equivalent expanded form:
let int_constant = ['-' '+']? ['0'-'9']+
The regular expression for float constants is more complicated:
let exponent = ['e' 'E']
let float_constant = sign? digit+ '.' digit+ (exponent sign? digit+)?
Finally, the regex for identifiers: a letter followed by zero or more letters, digits, and dashes.
let identifier = alpha (alpha | digit | '-')*
We also need a regex to match whitespace:
let whitespace = [' ' '\t']+
Next, we define the rule for parsing a token. We have three types of tokens in the language: integer constants, floating point constants, and words. After that we have to handle three special cases.
The general structure of a rule is:
| <regex> { <expression> }
When <regex>
matches the input stream, evaluate <expression>
. When writing
rules, the expression Lexing.lexeme lexbuf
extracts the text matched by the
regex on the left hand side of the rule.
The rules for parsing atoms are simple:
| int_constant { INT_CONSTANT (int_of_string (Lexing.lexeme lexbuf)) }
| float_constant { FLOAT_CONSTANT (float_of_string (Lexing.lexeme lexbuf)) }
| identifier { WORD (Lexing.lexeme lexbuf) }
Translated:
-
If the
int_constant
regex matches the input, take the string matched by the regex, convert the string to an integer, and return theINT_CONSTANT
token with said integer as the argument.Here,
INT_CONSTANT
is a function that takes an int and returns a token. This is defined, in a somewhat strange-looping manner, by the parser, which complicates the presentation. Just hold on and we’ll get to it in the next section. -
If the
float_constant
regex matches the input, take the matched string, convert it to a float, and return aFLOAT_CONSTANT
token with the float value. The type ofFLOAT_CONSTANT
isfloat -> token
. Again, see below. -
If the
identifier
regex is satisfied, return aWORD
token with just the matched text.
The last two rules are universal: eof
produces the EOF
token (we need this
for the parser, the start symbol needs to end with an EOF to signal to the
parser it need not consume anything else), and any character not matched by any
regular expression produces a lexer error:
| eof { EOF }
| _ { raise (Failure ("Character not allowed in source text: '" ^ Lexing.lexeme lexbuf ^ "'")) }
The Parser
Parsers are defined in .mly
files. A parser has two sections, separated by a
line with two percent signs:
<declarations>
%%
<rules>
The declarations
section allows you to define tokens and the start symbols of
the grammar, the start symbol being the rule where the parser begins. You can
have multiple start symbols, this can be useful to test different parts of the
parser in isolation.
First, we have to import the Ast
module so we can access the constructors of
the program
and atom
types to build up our abstract syntax tree.
%{
open Ast
%}
The percentage sign and curly brace are for embedding OCaml code in the parser. This can be, for example, to define functions that will be used in the rules. But our parser is very simple so we don’t need anything other than an import.
Then, we define our tokens. Tokens that carry no data (like language keywords,
symbols like braces, or the EOF
token) are defined with %token [name]
.
We define the EOF
token as follows:
%token EOF
Tokens can also carry information. For example, a string constant token would
carry the text of the string. We define this with the syntax %token <[type]>
[name]
. Note that the angle brackets are literal here.
We define the INT_CONSTANT
, FLOAT_CONSTANT
, and WORD
tokens we used
earlier in the parser:
%token <int> INT_CONSTANT
%token <float> FLOAT_CONSTANT
%token <string> WORD
Finally, we specify the start rule, which will be called program
:
%type <Ast.program> program
%start program
Menhir needs to know the type of all start symbols. Additionally, the type has
to be a fully qualified name, even though we’ve imported the Ast
module
earlier.
Now we move on to the rules.
The rule for parsing a program is very simple: a program is a non-empty sequence
of atoms, followed by EOF
:
%%
program:
| atom* EOF { Program $1 }
;
atom*
produces a potentially empty list of atom
instances.
The EOF
token is very important: all start symbols have to end with
EOF
. Otherwise, the parser will error because it expects nothing, but
encounters an EOF
token.
The rule for parsing atoms is a one-to-one match with the definition of atoms above: an atom is either an integer constant, a float constant, or a word.
atom:
| i=INT_CONSTANT { Int i }
| f=FLOAT_CONSTANT { Float f }
| s=WORD { Word s }
;
In the first case, the expression i
is of type int
, since the INT_CONSTANT
token carries an int
value. Analogously, in the second case the type of f
is
float
since FLOAT_CONSTANT
carries a float
, and in the third case the type
of s
is string
since the WORD
token carries a string
.
Note how the grammar definition matches the inductive definition of the types:
type program = Program of atom list
type atom =
| Int of int
| Float of float
| Word of string
The full contents of lib/Parser.mly
:
%{
open Ast
%}
%token EOF
%token <int> INT_CONSTANT
%token <float> FLOAT_CONSTANT
%token <string> WORD
/* Types */
%type <Ast.program> program
%start program
%%
program:
| atom* EOF { Program $1 }
;
atom:
| i=INT_CONSTANT { Int i }
| f=FLOAT_CONSTANT { Float f }
| s=WORD { Word s }
;
Parser Interface
This is just a bit of boilerplate: we create a function parse_program: string
-> program
. Parse errors throw the Failure
exception with line and column
information in the error message.
open Lexing
let colnum pos =
(pos.pos_cnum - pos.pos_bol) - 1
let pos_string pos =
let l = string_of_int pos.pos_lnum
and c = string_of_int ((colnum pos) + 1) in
"line " ^ l ^ ", column " ^ c
let parse' f s =
let lexbuf = Lexing.from_string s in
try
f Lexer.token lexbuf
with Parser.Error ->
raise (Failure ("Parse error at " ^ (pos_string lexbuf.lex_curr_p)))
let parse_program s =
parse' Parser.program s
Unit Tests
We use the OUnit2 unit test library.
First, we import the unit test library, and we import Ast
to have access to the constructors for the program
and atom
types, and we import ParserInterface
to have the parse_program
function.
open OUnit2
open Menhir_tutorial_core.Ast
open Menhir_tutorial_core.ParserInterface
peq
(for “parse equals”) is a brief utility function to check that the given
string parses to the given value:
let peq (s: string) (v: 'a) =
assert_equal v (parse_program s)
Unit tests of parsing integer constants:
let test_parse_int_constants _ =
peq "0" (Program [Int 0]);
peq "123" (Program [Int 123]);
peq "+123" (Program [Int 123]);
peq "1000" (Program [Int 1000]);
peq "-0" (Program [Int 0]);
peq "-123" (Program [Int (-123)])
Unit tests of parsing float constants:
let test_parse_float_constants _ =
peq "0.0" (Program [Float 0.0]);
peq "123.0" (Program [Float 123.0]);
peq "+123.0" (Program [Float 123.0]);
peq "1000.0" (Program [Float 1000.0]);
peq "-123.0" (Program [Float (-123.0)]);
peq "123.0e6" (Program [Float 123.0e6]);
peq "123.0e-6" (Program [Float 123.0e-6]);
peq "-123.0e6" (Program [Float (-123.0e6)]);
peq "-123.0e-6" (Program [Float (-123.0e-6)])
Unit tests of parsing words:
let test_parse_words _ =
peq "TEST" (Program [Word "TEST"]);
peq "TEST-PROGRAM" (Program [Word "TEST-PROGRAM"]);
peq "TEST-123" (Program [Word "TEST-123"])
Unit tests of parsing atom sequences:
let test_parse_programs _ =
peq "TEST 3.14 123 -3.4" (Program [Word "TEST"; Float 3.14; Int 123; Float (-3.4)]);
peq "A 1 B" (Program [Word "A"; Int 1; Word "B"])
Finally, we put it all together into a test suite and run it:
let suite =
"Parser tests" >::: [
"Integer constants" >:: test_parse_int_constants;
"Float constants" >:: test_parse_float_constants;
"Words" >:: test_parse_words;
"Programs" >:: test_parse_programs
]
let _ = run_test_tt_main suite
The full contents of test/ParserTest.ml
file:
open OUnit2
open Menhir_tutorial_core.Ast
open Menhir_tutorial_core.ParserInterface
let peq (s: string) (v: 'a) =
assert_equal v (parse_program s)
let test_parse_int_constants _ =
peq "0" (Program [Int 0]);
peq "123" (Program [Int 123]);
peq "+123" (Program [Int 123]);
peq "1000" (Program [Int 1000]);
peq "-0" (Program [Int 0]);
peq "-123" (Program [Int (-123)])
let test_parse_float_constants _ =
peq "0.0" (Program [Float 0.0]);
peq "123.0" (Program [Float 123.0]);
peq "+123.0" (Program [Float 123.0]);
peq "1000.0" (Program [Float 1000.0]);
peq "-123.0" (Program [Float (-123.0)]);
peq "123.0e6" (Program [Float 123.0e6]);
peq "123.0e-6" (Program [Float 123.0e-6]);
peq "-123.0e6" (Program [Float (-123.0e6)]);
peq "-123.0e-6" (Program [Float (-123.0e-6)])
let test_parse_words _ =
peq "TEST" (Program [Word "TEST"]);
peq "TEST-PROGRAM" (Program [Word "TEST-PROGRAM"]);
peq "TEST-123" (Program [Word "TEST-123"])
let test_parse_programs _ =
peq "TEST 3.14 123 -3.4" (Program [Word "TEST"; Float 3.14; Int 123; Float (-3.4)]);
peq "A 1 B" (Program [Word "A"; Int 1; Word "B"])
let suite =
"Parser tests" >::: [
"Integer constants" >:: test_parse_int_constants;
"Float constants" >:: test_parse_float_constants;
"Words" >:: test_parse_words;
"Programs" >:: test_parse_programs
]
let _ = run_test_tt_main suite
Bonus: An RPN Calculator
I would be remiss in my duties if I didn’t do this:
(* bin/menhir_tutorial.ml *)
open Menhir_tutorial_core.Ast
open Menhir_tutorial_core.ParserInterface
let pop = function
| [] -> raise (Failure "Stack is empty")
| first::rest -> (first, rest)
let push v s = v :: s
let rec eval_program program stack =
match program with
| [] ->
stack
| head::rest ->
let stack = (match head with
(* Constants are self-evaluating *)
| Int i -> push (Int i) stack
| Float i -> push (Float i) stack
| Word s ->
(match s with
| "ADD" -> oper (+.) stack
| "SUB" -> oper (-.) stack
| "MUL" -> oper ( *. ) stack
| "DIV" -> oper (/.) stack
| _ -> raise (Failure ("Unknown word: " ^ s))))
in
eval_program rest stack
and oper f stack =
let (rhs, stack) = pop stack in
let (lhs, stack) = pop stack in
let lhs = as_float lhs
and rhs = as_float rhs in
let res = Float (f lhs rhs) in
push res stack
and as_float = function
| Int i -> float_of_int i
| Float f -> f
| Word _ -> raise (Failure "Not a number.")
let rec repl _ =
print_string "> ";
let input = read_line () in
let (Program program) = parse_program input in
try
let result = eval_program program [] in
let output = dump_program (Program result) in
print_endline output;
print_endline "";
repl ()
with (Failure msg) ->
print_endline msg;
repl()
let _ = repl ()
Building and Running
You can clone this example, build it, and run the interpreter by running:
$ git clone git@github.com:eudoxia0/parsing-menhir.git
$ cd parsing-menhir/forth
$ dune build
$ ./_build/default/forth/bin/menhir_tutorial.exe
> 123
[Int 123]
> 3.14
[Float 3.14]
> 3.0 1.0 ADD
[Float 4.]
> 10 30 MUL
[Float 300.]
> 100 5 DIV
[Float 20.]