CompSci 142 / CSE 142 Winter 2018 | News | Course Reference | Schedule | Project Guide
This webpage was adapted from Alex Thornton’s offering of CS 141

CompSci 142 / CSE 142 Winter 2018
Project #4: Abstract Syntax Tree

Due date and time: Feb 16, Friday, 11:59pm; No late submission will be accepted.

Introduction

This project implements the intermediate representation that we use to model crux programs. Now that the parser can recognize syntax errors and detect symbol definition and usage errors, we can proceed with building an intermediate representation of crux programs. Between the front-end (parser) and back-end (code generator), we'll represent crux programs as an Abstract Syntax Tree (AST). Once crux source code has been transformed into an AST data structure we can further analyze the crux program to detect type errors (lab 5), perform optimizations, and generate code.

Design Goals for the AST

The AST that we create must faithfully represent the crux program being compiled. Additionally, we seek to make the AST as clear and easy to use as possible. Because we will later perform traversals over the AST to check for semantic constraints, we consider all of the following issues in the design:

Concise.

We should like to clean up any unnecessary features that may be present in the crux source. For example, the AST does not need to extra parentheses that may have been used in an expression.

Meaningful.

Nodes in the AST should carry some kind of semantic meaning. For example, we must track when and where variables and functions are declared or defined.

Instructive.

Nodes in the AST should represent an action (or instruction) that a computer might take. For example, we can have one node represent an if_statement. It can have 3 children: condition, thenBlock, and elseBlock.

Organized.

Nodes in the AST should be categorically distinguishable. That is, we should be able to identify the difference between statements and expressions.

An AST is not the Parse Tree

In Lab 2, we wrote a recursive descent parser. We recorded the entry and exit of each function and printed out the parse tree of crux source code. That tree records how a crux sentence (input source code) is broken down into syntactic pieces according to the rules of the crux grammar. Just as its name implies the Abstract Syntax Tree, abstracts away some of the pieces that might be present in the parse tree.

The AST avoids carrying extra syntax.

A crux sentence is allowed to carry extra information that does not necessarily change the semantics of the program. For example, according to the crux grammar parentheses can be used to nest expressions arbitrarily. Consider the following code examples, their parse trees and the corresponding AST.

Input Crux Statement

Parse Tree

Abstract Syntax Tree

if true {

   return 5;

IF_STATEMENT

  EXPRESSION0

    EXPRESSION1

      EXPRESSION2

        EXPRESSION3

          LITERAL

  STATEMENT_BLOCK

    STATEMENT_LIST

      STATEMENT

        RETURN_STATEMENT

          EXPRESSION0

            EXPRESSION1

              EXPRESSION2

                EXPRESSION3

                  LITERAL

ast.IfElseBranch(1,1)

  ast.LiteralBoolean(1,4)[TRUE]

  ast.StatementList(2,4)

    ast.Return(2,4)

      ast.LiteralInteger(2,11)[5]

  ast.StatementList(4,1)

if (((((true))))) {

   return 5;

IF_STATEMENT

  EXPRESSION0

    EXPRESSION1

      EXPRESSION2

        EXPRESSION3

          EXPRESSION0

            EXPRESSION1

              EXPRESSION2

                EXPRESSION3

                  EXPRESSION0

                    EXPRESSION1

                      EXPRESSION2

                        EXPRESSION3

                          EXPRESSION0

                            EXPRESSION1

                              EXPRESSION2

                                EXPRESSION3

                                  EXPRESSION0

                                    EXPRESSION1

                                      EXPRESSION2

                                        EXPRESSION3

                                          EXPRESSION0

                                            EXPRESSION1

                                              EXPRESSION2

                                                EXPRESSION3

                                                  LITERAL

  STATEMENT_BLOCK

    STATEMENT_LIST

      STATEMENT

        RETURN_STATEMENT

          EXPRESSION0

            EXPRESSION1

              EXPRESSION2

                EXPRESSION3

                  LITERAL

ast.IfElseBranch(1,1)

  ast.LiteralBoolean(1,9)[TRUE]

  ast.StatementList(2,4)

    ast.Return(2,4)

      ast.LiteralInteger(2,11)[5]

  ast.StatementList(4,1)

The AST has correct operator association.

In the crux grammar, the expression chain (expression0 → expression1 → expression2 → expression3) contains only right-associative rules, which generate a right-associative parse tree. In spite of the parse tree generated, the operators and, or, add, sub, mul, and div are, semantically, all left-associative. The parse tree accurately capture precedence, but incorrectly represent operator associativity. Using right association for the grammar rules aids the construction of a left-factored LL(1) grammar, which in turn aids writing a recursive descent parser. However, we must now take care to ensure that the AST captures the left-associative semantics of these operators.

Input Crux Statement

Parse Tree

Abstract Syntax Tree

return 3-1-1; // == 1

RETURN_STATEMENT

  EXPRESSION0

    EXPRESSION1

      EXPRESSION2

        EXPRESSION3

          LITERAL

OP1

      EXPRESSION2

        EXPRESSION3

          LITERAL

OP1

      EXPRESSION2

        EXPRESSION3

          LITERAL

ast.Return(1,1)

  ast.Subtraction(1,11)

    ast.Subtraction(1,9)

      ast.LiteralInteger(1,8)[3]

      ast.LiteralInteger(1,10)[1]

    ast.LiteralInteger(1,12)[1]

Nodes in the AST

The AST sits somewhere between a parse tree and a list of instructions for a machine to follow. It contains fewer nodes than the parse tree, and organizes those nodes into semantic categories. It contains higher-level information than a list of instructions, including variable declarations and function definitions. We intend the AST to be an intermediate representation that bridges the gap between source code and machine code.

The Command Class.

As a tree data structure, the AST is composed of nodes which inherit the abstract base class, Command. (I didn't want to use the term "instruction".) Each Command instance stores the line number and character position of the source code where it begins. Concrete subclasses store more specific information, to faithfully represent commands that actually occur in crux source code. We create a command class to record the actions a computer takes during execution of a crux program. For example, crux has commands for declaring variables, looping, creating constants, evaluating arithmetic and logical expressions, indexing arrays, etc.

Categorizing the subclasses.

For each command in the crux source code we associate a subclass of Command. Some commands can only occur in certain parts of the crux grammar. For example, FunctionDefinition can only occur as part of a DeclarationList and not inside a StatementList. In contrast, both ArrayDeclaration and VaribleDeclaration can occur in either a DeclarationList or a StatementList We use these observations to break down the commands into 3 categories, each represented by an interface: Declaration, Statement, Expression.

Command	Category (Interface)	Description
`ArrayDeclaration`	Declaration Statement	The creation of an array.
`VariableDeclaration`	Declaration Statement	The creation of variable.
`FunctionDefinition`	Declaration	The creation of a function.

`LiteralBoolean`	Expression	An embedded boolean constant, either 'true' or 'false'.
`LiteralFloat`	Expression	An embedded floating point number.
`LiteralInteger`	Expression	An embedded integer number.
`AddressOf`	Expression	The occurrence of an identifier as part of an expression (represents the address of that symbol).
`Dereference`	Expression	Load the value at a given address.
`Addition` `Subtraction` `Multiplication` `Division`	Expression	Represents basic arithmetic of two other expressions.
`Comparison`	Expression	Represents the comparison (greater than, greater equal, equal, not equal, lesser equal, less than) of two expressions.

`LogicalAnd` `LogicalOr` `LogicalNot`	Expression	Represents a logical operation (and, or) between two expressions or negation (not) of a single expression.
`Index`	Expression	An operator for indexing into an array. Both the base and the amount to index are expressions.
`Call`	Expression Statement	A function call, including an ExpressionList of arguments.
`Assignment`	Statement	An assignment of a source expression to a destination designator.
`IfElseBranch`	Statement	Represents an conditional if-else branch. Includes the condition expression, and StatementList's for each of the then and else branches.
`WhileLoop`	Statement	Represents a while loop, including the conditional expression and a StatementList for the body.
`Return`	Statement	A way for functions to return a value (and exit early).
`Error`	Declaration Expression Statement	Represents any error which may have occurred during construction of the AST.

Creating the AST

As the parser recursively descends through the parse tree of an input crux source code, it constructs the AST incrementally. We modify the methods responsible for recursive descent traversal so that the each returns a branch of the final AST. For example, because the program method parses a list of declarations, it returns a ast.DeclarationList. Likewise, each method in the expression chain returns an Expression, being careful to implement correct associativity for the operations involved. By returning AST nodes from each method, the Parser can build up the final AST as it unwinds the recursive travesal.

Viewing the AST with a Vistor

From this point forward, we will not be changing the crux language to add new operations. That means we won't be adding any new classes to the Command class hierarchy. However, we will be adding new functionality to each of the existing Command classes. For example, in Lab 5: Types, we'll implement type checking and ascribe a type to each node in the AST. Rather than change all the AST nodes to add a method, we'll use the Visitor Pattern.

In the Visitor pattern, each subclass of the Command hierarchy implements an accept(Visitor visitor) method that dispatches back to the actual visitor. Any class inheriting the CommandVisitor interface can implement additional functionality not present on the Command subclasses. For example, the supplied PrettyPrinter permits us to print the entire AST, but avoids adding a toPrettyString in each of the Command classes.

What do I need to implement?

Modify your Parser to produce an AST, and return it from the parse method. The AST must accurately reflect all the commands that can occur in a crux program. It's OK for the AST to contain the two symbol errors generated in Lab 3: Symbols. If the parser encounters a syntax error, it is unable to represent the crux source as an AST, and returns an Error node instead.

For convenience, you may get a start on this lab by using a pre-made Lab4_AbstractSyntaxTree.zip project, which contains the ast package. As before, you are both allowed and encouraged to make your program easier to read and maintain by implementing helper functions with good names.

Changes from Lab 3: Symbols

The Parser.parse method now returns a Command node representing an AST.
Modify the Compiler.main driver to print out the returned AST.
Add the ast package, which contains an implementation for each of the AST nodes.
Add helper methods to Parser: Token expectRetrieve(NonTerminal nt) and Token expectRetrieve(Token.Kind kind).
Change many (but not necessarily all) of the signatures of the Parser's recursive descent methods to return an ast node instead of void.

Testing

Test cases are available in this tests.zip file. The provided tests are not meant to be exhaustive. You are strongly encouraged to construct your own. (If chrome gives you a warning that "tests.zip is not commonly downloaded and could be dangerous" it means that Google hasn't performed an automated a virus scan. This warning can be safely ignored, as the tests.zip file contains only text files.)

Deliverables

A zip file, named Crux.zip, containing the following files (in the crux package):

crux.NonTerminal.java, which holds the FirstSets of all production rules in the grammar.
crux.Parser.java, which performs grammar recognition of an input text.
crux.Scanner.java, which performs incremental tokenization of an input text.
crux.Compiler.java, which houses the main() function that begins your program.
crux.Token.java, which represents a string of characters read in the input text.
crux.SymbolTable.java, which implements the symbol table.
crux.Symbol.java, which implements storage for identifiers (functions, variables, and arrays).
The ast package: A class for each Command, a Visitor interface, and a PrettyPrinter.

We will release an AutoTester soon. So please make sure that your work meets our requirements. We reserve the right to assign 0 points to any submissions which cannot be automatically unzipped and tested. Additionally, we reserve the right to assign 0 points to any submission which 'games' the automated testing by using a lookup which produces only outputs that correspond to the test cases we happen to use. Be sure to submit the version of the project that you want graded, as we won't regrade if you submit the wrong version by accident.

Enjoy!

Adapted from a similar document from CS 141 Winter 2013 by Alex Thornton,