Homework 2. Generating naive parsers for parsing expression grammars

Introduction

You're working as an assistant to a professor who is thinking about changing the CS 132 projects. CS 132 uses a parser generator that is based on LL parsers, a very mature technology, but the professor is thinking about switching to a parser generator based on parsing expression grammars (PEGs), which are less well-developed but have some promising properties. Among other things, a PEG is never ambiguous.

You'd like to test PEGs that are being proposed as test cases for CS 132 projects. One way is to test them on actual CS 132 projects, but solutions to these projects do not exist yet and anyway you'd like a second opinion in case the student projects are incorrect. So you decide to write a simple parser generator. Given the rules for a parsing expression grammar, and a start expression, your parser generator will generate a function that is a parser.

Like Homework 1, your parser generator should support any type of terminal and nonterminal symbols. This does not mean that your generator should use Homework 1's definition for the symbol type constructor. Instead, your code should define a type constructor ('nonterminal, 'terminal) pe for parsing expressions, and a type constructor ('nonterminal, 'terminal) pe_tree for parse trees produced by these expressions, with the properties defined in the next section.

Definitions

parsing expression

An OCaml term of type ('a, 'b) pe that has one of the following forms:

Empty: matches the empty string. This expression never fails to match.
T t: matches the terminal symbol that has the value t.
N n: matches the nonterminal symbol that has the value n.
Seq [e₁; ...; e_n]: matches the concatenation of instances of the subexpressions e₁, ..., e_n, left to right. If n = 0, this is equivalent to Empty.
Choose [e₁; ...; e_n]: matches whatever is matched by the first matching subexpression e₁, ..., e_n. If n = 0, this does not match anything (not even the empty string). This is prioritized choice: alternatives are tested in order, and the first successful match is used unconditionally, with later alternatives excluded. For example, Choose [T"a"; Seq [T"a";T"b"]] is equivalent to T"a" because the second alternative is excluded no matter what.
Star e: matches zero or more adjacent instances of e. This expression is greedy, that is, it always matches as many es as possible, and it refuses to match fewer than the maximum possible number of instances. Like Empty, this expression never fails.
Not e: matches the empty string, but only if the current input does not match e. If the current input matches e, Not e does not match anything.

parse tree

An OCaml term of type ('a, 'b) pe_tree that has one of the following forms:

Is_empty: represents an empty parse tree, which was matched with Empty.
Leaf t: represents the token with value t, which was matched with T t.
Node (n, T): represents the parse tree for the nonterminal with value n, which expanded into the parse tree T. This parse tree was matched with N n.
Sequence [T₁; ...; T_n]: represents the parse tree for a sequence with the subtrees T₁, ..., T_n, left to right. This tree was matched with Seq [e₁; ...; e_n], where each T_i was matched with e_i.
Repeat [T₁; ...; T_n]: represents the parse tree for a sequence with the subtrees T₁, ..., T_n, left to right. This tree was matched with Star e, where each T_i was matched with e.
Is_not e: represents an empty parse tree, which was matched with Not e.

There is no special syntax for the parse tree when input is matched by Choose [e₁; ...; e_n]; instead, the parse tree should simply be that of the chosen alternative.

parser

an OCaml function that accepts a list of terminal symbol values, and returns a value wrapped inside the Some constructor. When a parser fails to parse any prefix of its input, it returns None. When it successfully parses a prefix of its input, it returns

Some
(T, s)

, where T is the parse tree for the successfully-parsed prefix, and s is the unmatched suffix of the input, i.e., the part of the input string that was not parsed. If a parser parses all of its input, then s is [].

production function

an OCaml function whose argument is a nonterminal value. It returns the expression that the nonterminal expands to.

parser generator

a curried OCaml function that accepts a production function f and a starting expression e, and returns a parser for instances of e, assuming the grammar is described by f.

Assignment

Write a parser generator parse_pegproduce exp that returns a parser for the parsing expression grammar described by the production function produce and the parsing expression exp.

Also, write two good test cases for your parse_peg function. These test cases should all be in the style of the test cases given below. Your test cases should be named test1 and test2. The first test case should be based on the sample shell grammar shown below, except that your grammar should also support if-commands, including the tokens if, then, else, elif, and fi. The second test case should use a PEG of your own.

As with Homework 1, your code may use the Pervasives and List modules, but it should use no other modules. Your code should be free of side effects; for example, it should not use memoization. Simplicity is more important than efficiency, but your code should avoid using unnecessary time and space when it is easy to do so.

Assess your work by writing an after-action report that summarizes why you solved the problem the way you did, other approaches that you considered and rejected (and why you rejected them), and any weaknesses in your solution in the context of its intended application. This report should be a simple ASCII plain text file that consumes a page or so (at most 100 lines and 80 columns per line, please). See Resources for oral presentations and written reports for advice on how to write assessments; admittedly much of the advice there is overkill for the simple kind of report we're looking for here.

Unlike Homework 1, we are expecting some weaknesses here, so your assessment should talk about them. For example, we don't expect that your implementation will work with all possible grammars, but we would like to know which sort of grammars it will have trouble with.

Submit

We will test your program on the SEASnet Linux servers as before, so make sure that /usr/local/cs/bin is at the start of your path, using the same technique as in Homework 1.

Submit three files via CourseWeb. The file peg.ml should define parse_peg along with any auxiliary types and functions needed to define parse_peg. The file peg-test.ml should contain your test cases. The file peg.txt should hold your assessment. Please do not put your name, student ID, or other personally identifying information in your files.

Sample test cases

(* Many of the examples are so simple that there is only one nonterminal.
   For these examples, we use the 0-tuple () with no arguments as the
   nonterminal value.  For these grammars, the unit is the start symbol.
   For convenience, define the unit start symbol here.  *)
let start1 = N ()


(* A grammar for what sheep say.  *)

let baa_rules = fun () -> Seq [T"b"; Star (T"a")]

let test_baa1 =
  (parse_peg baa_rules start1 ["b"]
   = Some (Node ((), Sequence [Leaf "b"; Repeat []]), []))
let test_baa2 =
  (parse_peg baa_rules start1 ["b";"a";"a";"a";"a"]
   = Some (Node ((), Sequence [Leaf "b";
                               Repeat [Leaf "a"; Leaf "a"; Leaf "a"; Leaf "a"]]),
           []))
let test_baa2 =
  (parse_peg baa_rules start1 ["b";"a";"o";"b";"a";"b"]
   = Some (Node ((), Sequence [Leaf "b"; Repeat [Leaf "a"]]),
           ["o"; "b"; "a"; "b"]))
let test_baa4 = parse_peg baa_rules start1 ["a";"a";"b"] = None


(* A simple test grammar to make sure that Star doesn't loop when its
   subexpression matches nothing.  *)

let starempty_rules = fun () -> Seq [T"("; Star Empty; T")"]

let test_starempty1 =
   (parse_peg starempty_rules start1 ["("; ")"; "x"]
    = Some (Node ((), Sequence [Leaf "("; Repeat []; Leaf ")"]), ["x"]))
let test_starempty1 =
   parse_peg starempty_rules start1 ["("; "x"] = None


(* A subset of the standard POSIX shell grammar, translated into PEG
   format.  The complete original grammar is in
   <http://www.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10_02>.
   *)

type shell_nonterminal =
  | Complete_command
  | Cmd_name
  | Filename
  | Function_start
  | IO_redirect
  | Linebreak
  | Newline_list
  | Pipe_sequence
  | Simple_command
  | Separator
  | Separator_op

let simple_shell_rules =
  function
    | Complete_command ->
	Seq [N Pipe_sequence; Choose [N Separator; Empty]]
    | Pipe_sequence ->
	Seq [N Simple_command; Star (Seq [T"|"; N Simple_command])]
    | Simple_command ->
	Seq [N Cmd_name; Star (Choose [T"WORD"; N IO_redirect])]
    | Cmd_name -> Seq [Not (N Function_start); T"WORD"]
    | IO_redirect -> Seq [Choose [T"<"; T">"; T">>"; T"<>"; T">|"];
			  N Filename]
    | Filename -> T"WORD"
    | Newline_list -> Seq [T"\n"; N Linebreak]
    | Linebreak -> Star (T"\n")
    | Separator -> Choose [Seq [N Separator_op; N Linebreak]; N Newline_list]
    | Separator_op -> Choose [T"&"; T";"]
    | Function_start -> Seq [T"WORD"; T"("]

let parse_with_simple_shell_grammar = parse_peg simple_shell_rules

let parse_complete_command = parse_with_simple_shell_grammar (N Complete_command)

let test_cc1 = parse_complete_command [">"; "WORD"] = None

let test_cc2 =
   (parse_complete_command ["WORD"; ">"]
    = Some
       (Node (Complete_command,
	 Sequence
	  [Node (Pipe_sequence,
	    Sequence
	     [Node (Simple_command,
	       Sequence
		[Node (Cmd_name, Sequence [Is_not (N Function_start);
                                           Leaf "WORD"]);
		 Repeat []]);
	      Repeat []]);
	   Is_empty]),
	[">"]))

let test_cc3 =
   (parse_complete_command
       ["WORD"; "WORD"; "WORD"; "<"; "WORD"; "<>"; "WORD"; "\n"; "\n"; "\n"]
    = Some
       (Node (Complete_command,
	 Sequence
	  [Node (Pipe_sequence,
	    Sequence
	     [Node (Simple_command,
	       Sequence
		[Node (Cmd_name, Sequence [Is_not (N Function_start);
                                           Leaf "WORD"]);
		 Repeat
		  [Leaf "WORD"; Leaf "WORD";
		   Node (IO_redirect,
		    Sequence [Leaf "<"; Node (Filename, Leaf "WORD")]);
		   Node (IO_redirect,
		    Sequence [Leaf "<>"; Node (Filename, Leaf "WORD")])]]);
	      Repeat []]);
	   Node (Separator,
	    Node (Newline_list,
	     Sequence [Leaf "\n"; Node (Linebreak,
                                        Repeat [Leaf "\n"; Leaf "\n"])]))]),
	[]))

let test_cc4 =
   (parse_complete_command
       ["WORD"; "<"; "WORD"; "|"; "WORD"; "WORD"; "|";
        "WORD"; ">"; "WORD"; "BLEAGH"]
    = Some
       (Node (Complete_command,
	 Sequence
	  [Node (Pipe_sequence,
	    Sequence
	     [Node (Simple_command,
	       Sequence
		[Node (Cmd_name, Sequence [Is_not (N Function_start);
                                           Leaf "WORD"]);
		 Repeat
		  [Node (IO_redirect,
		    Sequence [Leaf "<"; Node (Filename, Leaf "WORD")])]]);
	      Repeat
	       [Sequence
		 [Leaf "|";
		  Node (Simple_command,
		   Sequence
		    [Node (Cmd_name,
		      Sequence [Is_not (N Function_start); Leaf "WORD"]);
		     Repeat [Leaf "WORD"]])];
		Sequence
		 [Leaf "|";
		  Node (Simple_command,
		   Sequence
		    [Node (Cmd_name,
		      Sequence [Is_not (N Function_start); Leaf "WORD"]);
		     Repeat
		      [Node (IO_redirect,
			Sequence [Leaf ">";
                                 Node (Filename, Leaf "WORD")])]])]]]);
	   Is_empty]),
	["BLEAGH"]))

Sample use of test cases

If you put the sample test cases into a file peg-sample.ml, you should be able to use it as follows to test your peg.ml solution on the SEASnet implementation of OCaml. Similarly, the command #use "peg-test.ml";; should run your own test cases on your solution.

$ ocaml
        Objective Caml version 3.12.0

# #use "peg.ml";;
type ('a, 'b) pe = ...
type ('a, 'b) pe_tree = ...
val parse_peg : ... = <fun>
...
# #use "peg-sample.ml";;
val start1 : (unit, 'a) pe = N ()
val baa_rules : unit -> ('a, string) pe = <fun>
val test_baa1 : bool = true
val test_baa2 : bool = true
val test_baa2 : bool = true
val test_baa4 : bool = true
val starempty_rules : unit -> ('a, string) pe = <fun>
val test_starempty1 : bool = true
val test_starempty1 : bool = true
type shell_nonterminal =
    Complete_command
  | Cmd_name
  | Filename
  | Function_start
  | IO_redirect
  | Linebreak
  | Newline_list
  | Pipe_sequence
  | Simple_command
  | Separator
  | Separator_op
val simple_shell_rules : shell_nonterminal -> (shell_nonterminal, string) pe =
  <fun>
val parse_with_simple_shell_grammar :
  (shell_nonterminal, string) pe ->
  string list -> ((shell_nonterminal, string) pe_tree * string list) option =
  <fun>
val parse_complete_command :
  string list -> ((shell_nonterminal, string) pe_tree * string list) option =
  <fun>
val test_cc1 : bool = true
val test_cc2 : bool = true
val test_cc3 : bool = true
val test_cc4 : bool = true
#

Reference

Ford B. Parsing expression grammars: a recognition-based syntactic foundation [PDF]. Proc 31st ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL), 2004, 111–22.