You're working as an assistant to a professor who is thinking about changing the CS 132 projects. CS 132 uses a parser generator that is based on LL parsers, a very mature technology, but the professor is thinking about switching to a parser generator based on parsing expression grammars (PEGs), which are less well-developed but have some promising properties. Among other things, a PEG is never ambiguous.
You'd like to test PEGs that are being proposed as test cases for CS 132 projects. One way is to test them on actual CS 132 projects, but solutions to these projects do not exist yet and anyway you'd like a second opinion in case the student projects are incorrect. So you decide to write a simple parser generator. Given the rules for a parsing expression grammar, and a start expression, your parser generator will generate a function that is a parser.
Like Homework 1, your parser generator
should support any type
of terminal and nonterminal symbols. This does not mean that
your generator should use Homework 1's definition for
the symbol
type constructor.
Instead, your code should define a
type constructor ('nonterminal, 'terminal) pe
for parsing
expressions, and a type constructor ('nonterminal, 'terminal)
pe_tree
for parse trees produced by these expressions,
with the properties defined in the next section.
('a, 'b) pe
that has one of the following forms:
Empty
T t
t
.N n
n
.Seq [e1;
...; en]
e1
, ..., en
, left to right. If n
= 0, this is equivalent to Empty
.Choose [e1;
...; en]
e1
, ..., en
. If n
= 0, this does not match anything (not even the empty string). This is prioritized choice: alternatives are tested in order, and the first successful match is used unconditionally, with later alternatives excluded. For example, Choose [T"a"; Seq [T"a";T"b"]]
is equivalent to T"a"
because the second alternative is excluded no matter what.Star e
e
. This expression is greedy, that is, it always matches as many e
s as possible, and it refuses to match fewer than the maximum possible number of instances. Like Empty
, this expression never fails.Not e
e
. If the current input matches e
, Not e
does not match anything.('a, 'b) pe_tree
that has one of the following forms:
Is_empty
Empty
.Leaf t
t
, which was matched with T t
.Node (n, T)
n
,
which expanded into the parse tree T
. This
parse tree was matched with N n
.Sequence [T1;
...; Tn]
T1
, ..., Tn
, left to right. This tree was matched with Seq [e1;
...; en]
, where each Ti
was matched with ei
.Repeat [T1;
...; Tn]
T1
, ..., Tn
, left to right. This tree was matched with Star e
, where each Ti
was matched with e
.Is_not e
Not e
.There is no special syntax for the parse tree when input is matched by Choose [e1;
...; en]
; instead, the parse tree should simply be that of the chosen alternative.
Some
constructor. When a parser fails to parse any prefix of its input, it returns
None
. When it successfully parses a prefix of its input, it returns Some
(T, s)
, where T
is the parse tree for the successfully-parsed prefix,
and s
is the unmatched suffix of the input, i.e.,
the part of the input string that was not parsed. If a parser parses all of
its input, then s
is []
.Write a parser generator parse_peg
produce exp that returns a parser for the
parsing expression grammar described by the production function
produce
and the parsing expression
exp
.
Also, write two good test cases for your parse_peg
function. These test cases should all be in the style of the test
cases given below. Your test cases should be named test1
and test2
. The first test case
should be based on the sample shell grammar shown below, except that
your grammar should also support if-commands, including the
tokens if
, then
, else
, elif
,
and fi
. The second test case should use a PEG of your
own.
As with Homework 1, your code may use
the Pervasives
and List
modules, but it should use no other modules. Your code should be free
of side
effects; for example, it should not
use memoization. Simplicity
is more important than efficiency, but your code should avoid using
unnecessary time and space when it is easy to do so.
Assess your work by writing an after-action report that summarizes why you solved the problem the way you did, other approaches that you considered and rejected (and why you rejected them), and any weaknesses in your solution in the context of its intended application. This report should be a simple ASCII plain text file that consumes a page or so (at most 100 lines and 80 columns per line, please). See Resources for oral presentations and written reports for advice on how to write assessments; admittedly much of the advice there is overkill for the simple kind of report we're looking for here.
Unlike Homework 1, we are expecting some weaknesses here, so your assessment should talk about them. For example, we don't expect that your implementation will work with all possible grammars, but we would like to know which sort of grammars it will have trouble with.
We will test your program on the SEASnet Linux servers as before,
so make sure that
/usr/local/cs/bin
is at the start of your path,
using the same technique as in Homework 1.
Submit three files via CourseWeb. The file peg.ml
should define parse_peg
along with any auxiliary types
and functions needed to define parse_peg
. The file
peg-test.ml
should contain your test cases. The file
peg.txt
should hold your assessment. Please do not put
your name, student ID, or other personally identifying information in
your files.
(* Many of the examples are so simple that there is only one nonterminal.
For these examples, we use the 0-tuple () with no arguments as the
nonterminal value. For these grammars, the unit is the start symbol.
For convenience, define the unit start symbol here. *)
let start1 = N ()
(* A grammar for what sheep say. *)
let baa_rules = fun () -> Seq [T"b"; Star (T"a")]
let test_baa1 =
(parse_peg baa_rules start1 ["b"]
= Some (Node ((), Sequence [Leaf "b"; Repeat []]), []))
let test_baa2 =
(parse_peg baa_rules start1 ["b";"a";"a";"a";"a"]
= Some (Node ((), Sequence [Leaf "b";
Repeat [Leaf "a"; Leaf "a"; Leaf "a"; Leaf "a"]]),
[]))
let test_baa2 =
(parse_peg baa_rules start1 ["b";"a";"o";"b";"a";"b"]
= Some (Node ((), Sequence [Leaf "b"; Repeat [Leaf "a"]]),
["o"; "b"; "a"; "b"]))
let test_baa4 = parse_peg baa_rules start1 ["a";"a";"b"] = None
(* A simple test grammar to make sure that Star doesn't loop when its
subexpression matches nothing. *)
let starempty_rules = fun () -> Seq [T"("; Star Empty; T")"]
let test_starempty1 =
(parse_peg starempty_rules start1 ["("; ")"; "x"]
= Some (Node ((), Sequence [Leaf "("; Repeat []; Leaf ")"]), ["x"]))
let test_starempty1 =
parse_peg starempty_rules start1 ["("; "x"] = None
(* A subset of the standard POSIX shell grammar, translated into PEG
format. The complete original grammar is in
<http://www.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10_02>.
*)
type shell_nonterminal =
| Complete_command
| Cmd_name
| Filename
| Function_start
| IO_redirect
| Linebreak
| Newline_list
| Pipe_sequence
| Simple_command
| Separator
| Separator_op
let simple_shell_rules =
function
| Complete_command ->
Seq [N Pipe_sequence; Choose [N Separator; Empty]]
| Pipe_sequence ->
Seq [N Simple_command; Star (Seq [T"|"; N Simple_command])]
| Simple_command ->
Seq [N Cmd_name; Star (Choose [T"WORD"; N IO_redirect])]
| Cmd_name -> Seq [Not (N Function_start); T"WORD"]
| IO_redirect -> Seq [Choose [T"<"; T">"; T">>"; T"<>"; T">|"];
N Filename]
| Filename -> T"WORD"
| Newline_list -> Seq [T"\n"; N Linebreak]
| Linebreak -> Star (T"\n")
| Separator -> Choose [Seq [N Separator_op; N Linebreak]; N Newline_list]
| Separator_op -> Choose [T"&"; T";"]
| Function_start -> Seq [T"WORD"; T"("]
let parse_with_simple_shell_grammar = parse_peg simple_shell_rules
let parse_complete_command = parse_with_simple_shell_grammar (N Complete_command)
let test_cc1 = parse_complete_command [">"; "WORD"] = None
let test_cc2 =
(parse_complete_command ["WORD"; ">"]
= Some
(Node (Complete_command,
Sequence
[Node (Pipe_sequence,
Sequence
[Node (Simple_command,
Sequence
[Node (Cmd_name, Sequence [Is_not (N Function_start);
Leaf "WORD"]);
Repeat []]);
Repeat []]);
Is_empty]),
[">"]))
let test_cc3 =
(parse_complete_command
["WORD"; "WORD"; "WORD"; "<"; "WORD"; "<>"; "WORD"; "\n"; "\n"; "\n"]
= Some
(Node (Complete_command,
Sequence
[Node (Pipe_sequence,
Sequence
[Node (Simple_command,
Sequence
[Node (Cmd_name, Sequence [Is_not (N Function_start);
Leaf "WORD"]);
Repeat
[Leaf "WORD"; Leaf "WORD";
Node (IO_redirect,
Sequence [Leaf "<"; Node (Filename, Leaf "WORD")]);
Node (IO_redirect,
Sequence [Leaf "<>"; Node (Filename, Leaf "WORD")])]]);
Repeat []]);
Node (Separator,
Node (Newline_list,
Sequence [Leaf "\n"; Node (Linebreak,
Repeat [Leaf "\n"; Leaf "\n"])]))]),
[]))
let test_cc4 =
(parse_complete_command
["WORD"; "<"; "WORD"; "|"; "WORD"; "WORD"; "|";
"WORD"; ">"; "WORD"; "BLEAGH"]
= Some
(Node (Complete_command,
Sequence
[Node (Pipe_sequence,
Sequence
[Node (Simple_command,
Sequence
[Node (Cmd_name, Sequence [Is_not (N Function_start);
Leaf "WORD"]);
Repeat
[Node (IO_redirect,
Sequence [Leaf "<"; Node (Filename, Leaf "WORD")])]]);
Repeat
[Sequence
[Leaf "|";
Node (Simple_command,
Sequence
[Node (Cmd_name,
Sequence [Is_not (N Function_start); Leaf "WORD"]);
Repeat [Leaf "WORD"]])];
Sequence
[Leaf "|";
Node (Simple_command,
Sequence
[Node (Cmd_name,
Sequence [Is_not (N Function_start); Leaf "WORD"]);
Repeat
[Node (IO_redirect,
Sequence [Leaf ">";
Node (Filename, Leaf "WORD")])]])]]]);
Is_empty]),
["BLEAGH"]))
If you put the sample test cases into a file
peg-sample.ml
, you should be able to use it as follows to
test your peg.ml
solution on the SEASnet implementation
of OCaml. Similarly, the command #use "peg-test.ml";;
should run your own test cases on your solution.
$ ocaml
Objective Caml version 3.12.0
# #use "peg.ml";;
type ('a, 'b) pe = ...
type ('a, 'b) pe_tree = ...
val parse_peg : ... = <fun>
...
# #use "peg-sample.ml";;
val start1 : (unit, 'a) pe = N ()
val baa_rules : unit -> ('a, string) pe = <fun>
val test_baa1 : bool = true
val test_baa2 : bool = true
val test_baa2 : bool = true
val test_baa4 : bool = true
val starempty_rules : unit -> ('a, string) pe = <fun>
val test_starempty1 : bool = true
val test_starempty1 : bool = true
type shell_nonterminal =
Complete_command
| Cmd_name
| Filename
| Function_start
| IO_redirect
| Linebreak
| Newline_list
| Pipe_sequence
| Simple_command
| Separator
| Separator_op
val simple_shell_rules : shell_nonterminal -> (shell_nonterminal, string) pe =
<fun>
val parse_with_simple_shell_grammar :
(shell_nonterminal, string) pe ->
string list -> ((shell_nonterminal, string) pe_tree * string list) option =
<fun>
val parse_complete_command :
string list -> ((shell_nonterminal, string) pe_tree * string list) option =
<fun>
val test_cc1 : bool = true
val test_cc2 : bool = true
val test_cc3 : bool = true
val test_cc4 : bool = true
#
Ford B. Parsing expression grammars: a recognition-based syntactic foundation [PDF]. Proc 31st ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL), 2004, 111–22.