Homework 4. Parsing with parsing expression grammars

Introduction

You solved Homework 2, but your professor is unhappy with the solution. It works, but it's so complicated that the professor's other assistants are having trouble coming up to speed with it, when they need to modify it. Your professor suggests that you code it up in Prolog instead, and see whether that version is easier to understand.

Definitions

parsing expression

An Prolog term that has one of the following forms:

+T: (where T is an atom, i.e., it satisfies the atom/1 predicate) matches the terminal symbol that has the value T.
N: (where N is an atom) matches the nonterminal symbol that has the value n.
seq([E₁, ..., E_n]): matches the concatenation of instances of the subexpressions E₁, ..., E_n, left to right. If n = 0, it matches only the empty string.
choose([E₁, ..., E_n]): matches whatever is matched by the first matching subexpression E₁, ..., E_n. If n = 0, this does not match anything (not even the empty string). This is prioritized choice: alternatives are tested in order, and the first successful match is used unconditionally, with later alternatives excluded. For example, choose([+a, seq([+a, +b])]) is equivalent to +a because the second alternative is excluded no matter what.
star(E): matches zero or more adjacent instances of E. This expression is greedy, that is, it always matches as many Es as possible, and it refuses to match fewer than the maximum possible number of instances. This expression never fails to match, since it can always match the empty string if all else fails.
not(E): matches the empty string, but only if the current input does not match E. If the current input matches E, not(E) does not match anything.

Unlike Homework 2, there is no separate representation for the empty PEG. You can use seq([]) instead.

parse tree

A Prolog term that has one of the following forms:

T: represents the token with value T, which was matched with +T.
node(N, T): represents the parse tree for the nonterminal with value N, which expanded into the parse tree T. This parse tree was matched with N.
sequence([T₁, ..., T_n]): represents the parse tree for a sequence with the subtrees T₁, ..., T_n, left to right. This tree was matched with sequence([E₁, ..., E_n]), where each T_i was matched with E_i.
repeat([T₁, ..., T_n]): represents the parse tree for a sequence with the subtrees T₁, ..., T_n, left to right. This tree was matched with star(E), where each T_i was matched with E.
is_not(E): represents an empty parse tree, which was matched with not(E).

There is no special syntax for the parse tree when input is matched by choose([E₁, ..., E_n]); instead, the parse tree should simply be that of the chosen alternative.

Assignment

Write a general parser peg_parse(Exp, Tree, LM, M) that takes the parsing expression Exp and matches it against a prefix of the list LM of terminal symbols. If it parses successfully, Tree is the corresponding parse tree and M is the unmatched suffix of LM. Your parser should assume that the grammar is defined by a predicate rule/2 whose first argument is a nonterminal and whose second argument is the parsing expression that the nonterminal stands for.

Also, write two good test cases for your peg_parse function. These test cases should all be in the style of the test cases given below. Your test cases should be named test1 and test2. The first test case should be based on the sample shell grammar shown below, except that your grammar should also support if-commands, including the tokens if, then, else, elif, and fi. The second test case should use a PEG of your own. You can use the same test cases that you used to answer Homework 2, translated into Prolog.

As with Homeworks 1 and 2, your code should be free of side effects; for example, it should not use I/O. Your code may use cuts (!/0), however. Simplicity is more important than efficiency, but your code should avoid using unnecessary time and space when it is easy to do so.

Assess your work by writing an after-action report that summarizes the strengths and weaknesses of your solution compared to a good solution for Homework 2. This report should be a simple ASCII plain text file that consumes a page or so (at most 100 lines and 80 columns per line, please). See Resources for oral presentations and written reports for advice on how to write assessments; admittedly much of the advice there is overkill for the simple kind of report we're looking for here.

Submit

We will test your program on the SEASnet Linux servers as before, so make sure that /usr/local/cs/bin is at the start of your path, using the same technique as in Homework 1.

Submit three files via CourseWeb. The file peg.prolog should define peg_parse along with any auxiliary functions needed to define peg_parse. The file peg-test.prolog should contain your test cases. The file peg.txt should hold your assessment. Please do not put your name, student ID, or other personally identifying information in your files.

Sample test cases

rule(complete_command,
     seq([pipe_sequence, choose([separator, empty])])).
rule(pipe_sequence,
     seq([simple_command, star(seq([+'|', simple_command]))])).
rule(simple_command,
     seq([cmd_name, star(choose([+word, io_redirect]))])).
rule(cmd_name,
     seq([not(function_start), +word])).
rule(io_redirect,
     seq([choose([+(<), +(>), +(>>), +(<>), +('>|')]), filename])).
rule(filename, +word).
rule(newline_list, seq([+'\n', linebreak])).
rule(linebreak, star(+'\n')).
rule(separator, choose([seq([separator_op, linebreak]), newline_list])).
rule(separator_op, choose([+('&'), +(;)])).
rule(function_start, seq([+word, +('(')])).
rule(empty, seq([])).


false_test(cc1, [>, word]).

true_test(cc2, [word, >], [>],
  node(complete_command,
    sequence(
      [node(pipe_sequence,
	 sequence(
	   [node(simple_command,
	      sequence(
		[node(cmd_name,
		   sequence(
		     [is_not(function_start),word])),
		 repeat([])])),
	    repeat([])])),
       node(empty,sequence([]))]))).

true_test(cc3, [word, word, word, <, word, <>, word, '\n', '\n', '\n'], [],
  node(complete_command,
    sequence(
      [node(pipe_sequence,
         sequence(
	   [node(simple_command,
	      sequence(
	        [node(cmd_name,
		   sequence(
		     [is_not(function_start),word])),
		 repeat(
		   [word,word,
		    node(io_redirect, sequence([<,node(filename,word)])),
		    node(io_redirect, sequence([<>,node(filename,word)]))])])),
            repeat([])])),
       node(separator,
	 node(newline_list,
	   sequence(['\n', node(linebreak, repeat(['\n','\n']))])))]))).

true_test(cc4, [word, <, word, '|', word, >, word, bleagh], [bleagh],
  node(complete_command,
    sequence(
      [node(pipe_sequence,
	 sequence(
	   [node(simple_command,
	      sequence(
	        [node(cmd_name,
		   sequence([is_not(function_start),word])),
                 repeat(
		   [node(io_redirect,
		     sequence([<, node(filename, word)]))])])),
            repeat(
	      [sequence(
	         ['|',
		  node(simple_command,
		    sequence(
		      [node(cmd_name,
		         sequence([is_not(function_start), word])),
                       repeat(
		         [node(io_redirect,
			  sequence([>,node(filename,word)]))])]))])])])),
       node(empty,sequence([]))]))).

test_parse(LM, M, T) :- peg_parse(complete_command, T, LM, M).

test_failed(Name, peg_parse_should_have_failed,
	    input=LM, remaining=M, tree=T) :-
  false_test(Name, LM),
  test_parse(LM, M, T).

test_failed(Name, peg_parse_should_have_succeeded,
	    input=LM, remaining=MvsM1, tree=TvsT1) :-
  true_test(Name, LM, M, T),
  test_parse(LM, M1, T1),
  (M \= M1; T \= T1),
  (M = M1 -> MvsM1 = ok ; MvsM1 = (M \= M1)),
  (T = T1 -> TvsT1 = ok ; TvsT1 = (T \= T1)).

Sample use of test cases

If you put the sample test cases into a file peg-sample.prolog, you should be able to use it as follows to test your peg.prolog solution on the SEASnet implementation of GNU Prolog. Similarly, the top-level command ['peg-test.prolog'], test_failed(Name, Type, Input, Remaining, Tree) should run your own test cases on your solution.

$ gprolog
GNU Prolog 1.3.1
By Daniel Diaz
Copyright (C) 1999-2009 Daniel Diaz
| ?- ['peg.prolog'].
compiling .../peg.prolog for byte code...
.../peg.prolog compiled, ...

yes
| ?- ['peg-sample.prolog'].
compiling .../peg-sample.prolog for byte code...
.../peg-sample.prolog compiled, ...

yes
| ?- test_failed(A,B,C,D,E).

no

Reference

Ford B. Parsing expression grammars: a recognition-based syntactic foundation [PDF]. Proc 31st ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL), 2004, 111–22.