CompSci 142 / CSE 142 Winter 2018 | News | Course Reference | Schedule | Project Guide
This webpage was adapted from Alex Thornton’s offering of CS 141


CompSci 142 / CSE 142 Winter 2018
Project #4: Abstract Syntax Tree

Due date and time: Feb 16, Friday, 11:59pm; No late submission will be accepted.


Introduction

This project implements the intermediate representation that we use to model crux programs.  Now that the parser can recognize syntax errors and detect symbol definition and usage errors, we can proceed with building an intermediate representation of crux programs. Between the front-end (parser) and back-end (code generator), we'll represent crux programs as an Abstract Syntax Tree (AST). Once crux source code has been transformed into an AST data structure we can further analyze the crux program to detect type errors (lab 5), perform optimizations, and generate code.

Design Goals for the AST

The AST that we create must faithfully represent the crux program being compiled. Additionally, we seek to make the AST as clear and easy to use as possible. Because we will later perform traversals over the AST to check for semantic constraints, we consider all of the following issues in the design:

Concise.

We should like to clean up any unnecessary features that may be present in the crux source. For example, the AST does not need to extra parentheses that may have been used in an expression.

Meaningful.

Nodes in the AST should carry some kind of semantic meaning. For example, we must track when and where variables and functions are declared or defined.

Instructive.

Nodes in the AST should represent an action (or instruction) that a computer might take. For example, we can have one node represent an if_statement. It can have 3 children: condition, thenBlock, and elseBlock.

Organized.

Nodes in the AST should be categorically distinguishable. That is, we should be able to identify the difference between statements and expressions.

An AST is not the Parse Tree

In Lab 2, we wrote a recursive descent parser. We recorded the entry and exit of each function and printed out the parse tree of crux source code. That tree records how a crux sentence (input source code) is broken down into syntactic pieces according to the rules of the crux grammar. Just as its name implies the Abstract Syntax Tree, abstracts away some of the pieces that might be present in the parse tree.

The AST avoids carrying extra syntax.

A crux sentence is allowed to carry extra information that does not necessarily change the semantics of the program. For example, according to the crux grammar parentheses can be used to nest expressions arbitrarily. Consider the following code examples, their parse trees and the corresponding AST.

Input Crux Statement

Parse Tree

Abstract Syntax Tree

if true {
   return 5;
}
          
IF_STATEMENT
  EXPRESSION0
    EXPRESSION1
      EXPRESSION2
        EXPRESSION3
          LITERAL
  STATEMENT_BLOCK
    STATEMENT_LIST
      STATEMENT
        RETURN_STATEMENT
          EXPRESSION0
            EXPRESSION1
              EXPRESSION2
                EXPRESSION3
                  LITERAL
          
ast.IfElseBranch(1,1)
  ast.LiteralBoolean(1,4)[TRUE]
  ast.StatementList(2,4)
    ast.Return(2,4)
      ast.LiteralInteger(2,11)[5]
  ast.StatementList(4,1)
          
if (((((true))))) {
   return 5;
}
          
IF_STATEMENT
  EXPRESSION0
    EXPRESSION1
      EXPRESSION2
        EXPRESSION3
          EXPRESSION0
            EXPRESSION1
              EXPRESSION2
                EXPRESSION3
                  EXPRESSION0
                    EXPRESSION1
                      EXPRESSION2
                        EXPRESSION3
                          EXPRESSION0
                            EXPRESSION1
                              EXPRESSION2
                                EXPRESSION3
                                  EXPRESSION0
                                    EXPRESSION1
                                      EXPRESSION2
                                        EXPRESSION3
                                          EXPRESSION0
                                            EXPRESSION1
                                              EXPRESSION2
                                                EXPRESSION3
                                                  LITERAL
  STATEMENT_BLOCK
    STATEMENT_LIST
      STATEMENT
        RETURN_STATEMENT
          EXPRESSION0
            EXPRESSION1
              EXPRESSION2
                EXPRESSION3
                  LITERAL
          
ast.IfElseBranch(1,1)
  ast.LiteralBoolean(1,9)[TRUE]
  ast.StatementList(2,4)
    ast.Return(2,4)
      ast.LiteralInteger(2,11)[5]
  ast.StatementList(4,1)
          

The AST has correct operator association.

In the crux grammar, the expression chain (expression0 → expression1 → expression2 → expression3) contains only right-associative rules, which generate a right-associative parse tree. In spite of the parse tree generated, the operators and, or, add, sub, mul, and div are, semantically, all left-associative. The parse tree accurately capture precedence, but incorrectly represent operator associativity. Using right association for the grammar rules aids the construction of a left-factored LL(1) grammar, which in turn aids writing a recursive descent parser. However, we must now take care to ensure that the AST captures the left-associative semantics of these operators.

Input Crux Statement

Parse Tree

Abstract Syntax Tree

return 3-1-1; // == 1
RETURN_STATEMENT
  EXPRESSION0
    EXPRESSION1
      EXPRESSION2
        EXPRESSION3
          LITERAL
      OP1
      EXPRESSION2
        EXPRESSION3
          LITERAL
      OP1
      EXPRESSION2
        EXPRESSION3
          LITERAL
          
ast.Return(1,1)
  ast.Subtraction(1,11)
    ast.Subtraction(1,9)
      ast.LiteralInteger(1,8)[3]
      ast.LiteralInteger(1,10)[1]
    ast.LiteralInteger(1,12)[1]
          

Nodes in the AST

The AST sits somewhere between a parse tree and a list of instructions for a machine to follow. It contains fewer nodes than the parse tree, and organizes those nodes into semantic categories. It contains higher-level information than a list of instructions, including variable declarations and function definitions. We intend the AST to be an intermediate representation that bridges the gap between source code and machine code.

The Command Class.

As a tree data structure, the AST is composed of nodes which inherit the abstract base class, Command. (I didn't want to use the term "instruction".) Each Command instance stores the line number and character position of the source code where it begins. Concrete subclasses store more specific information, to faithfully represent commands that actually occur in crux source code. We create a command class to record the actions a computer takes during execution of a crux program. For example, crux has commands for declaring variables, looping, creating constants, evaluating arithmetic and logical expressions, indexing arrays, etc.

Categorizing the subclasses.

For each command in the crux source code we associate a subclass of Command. Some commands can only occur in certain parts of the crux grammar. For example, FunctionDefinition can only occur as part of a DeclarationList and not inside a StatementList. In contrast, both ArrayDeclaration and VaribleDeclaration can occur in either a DeclarationList or a StatementList We use these observations to break down the commands into 3 categories, each represented by an interface: Declaration, Statement, Expression.

Command

Category
(Interface)

Description

ArrayDeclaration

Declaration

Statement

The creation of an array.

VariableDeclaration

Declaration
Statement

The creation of variable.

FunctionDefinition

Declaration

The creation of a function.

LiteralBoolean

Expression

An embedded boolean constant, either 'true' or 'false'.

LiteralFloat

Expression

An embedded floating point number.

LiteralInteger

Expression

An embedded integer number.

AddressOf

Expression

The occurrence of an identifier as part of an expression (represents the address of that symbol).

Dereference

Expression

Load the value at a given address.

Addition
Subtraction
Multiplication
Division

Expression

Represents basic arithmetic of two other expressions.

Comparison

Expression

Represents the comparison (greater than, greater equal, equal, not equal, lesser equal, less than) of two expressions.

LogicalAnd
LogicalOr
LogicalNot

Expression

Represents a logical operation (and, or) between two expressions or negation (not) of a single expression.

Index

Expression

An operator for indexing into an array. Both the base and the amount to index are expressions.

Call

Expression
Statement

A function call, including an ExpressionList of arguments.

Assignment

Statement

An assignment of a source expression to a destination designator.

IfElseBranch

Statement

Represents an conditional if-else branch. Includes the condition expression, and StatementList's for each of the then and else branches.

WhileLoop

Statement

Represents a while loop, including the conditional expression and a StatementList for the body.

Return

Statement

A way for functions to return a value (and exit early).

Error

Declaration
Expression
Statement

Represents any error which may have occurred during construction of the AST.

Creating the AST

As the parser recursively descends through the parse tree of an input crux source code, it constructs the AST incrementally. We modify the methods responsible for recursive descent traversal so that the each returns a branch of the final AST. For example, because the program method parses a list of declarations, it returns a ast.DeclarationList. Likewise, each method in the expression chain returns an Expression, being careful to implement correct associativity for the operations involved. By returning AST nodes from each method, the Parser can build up the final AST as it unwinds the recursive travesal.

Viewing the AST with a Vistor

From this point forward, we will not be changing the crux language to add new operations. That means we won't be adding any new classes to the Command class hierarchy. However, we will be adding new functionality to each of the existing Command classes. For example, in Lab 5: Types, we'll implement type checking and ascribe a type to each node in the AST. Rather than change all the AST nodes to add a method, we'll use the Visitor Pattern.

In the Visitor pattern, each subclass of the Command hierarchy implements an accept(Visitor visitor) method that dispatches back to the actual visitor. Any class inheriting the CommandVisitor interface can implement additional functionality not present on the Command subclasses. For example, the supplied PrettyPrinter permits us to print the entire AST, but avoids adding a toPrettyString in each of the Command classes.

What do I need to implement?

Modify your Parser to produce an AST, and return it from the parse method. The AST must accurately reflect all the commands that can occur in a crux program. It's OK for the AST to contain the two symbol errors generated in Lab 3: Symbols. If the parser encounters a syntax error, it is unable to represent the crux source as an AST, and returns an Error node instead.

For convenience, you may get a start on this lab by using a pre-made Lab4_AbstractSyntaxTree.zip project, which contains the ast package. As before, you are both allowed and encouraged to make your program easier to read and maintain by implementing helper functions with good names.

Changes from Lab 3: Symbols

  • The Parser.parse method now returns a Command node representing an AST.
  • Modify the Compiler.main driver to print out the returned AST.
  • Add the ast package, which contains an implementation for each of the AST nodes.
  • Add helper methods to Parser: Token expectRetrieve(NonTerminal nt) and Token expectRetrieve(Token.Kind kind).
  • Change many (but not necessarily all) of the signatures of the Parser's recursive descent methods to return an ast node instead of void.

Testing

Test cases are available in this tests.zip file. The provided tests are not meant to be exhaustive. You are strongly encouraged to construct your own. (If chrome gives you a warning that "tests.zip is not commonly downloaded and could be dangerous" it means that Google hasn't performed an automated a virus scan. This warning can be safely ignored, as the tests.zip file contains only text files.)

Deliverables

A zip file, named Crux.zip, containing the following files (in the crux package):

  • crux.NonTerminal.java, which holds the FirstSets of all production rules in the grammar.
  • crux.Parser.java, which performs grammar recognition of an input text.
  • crux.Scanner.java, which performs incremental tokenization of an input text.
  • crux.Compiler.java, which houses the main() function that begins your program.
  • crux.Token.java, which represents a string of characters read in the input text.
  • crux.SymbolTable.java, which implements the symbol table.
  • crux.Symbol.java, which implements storage for identifiers (functions, variables, and arrays).
  • The ast package: A class for each Command, a Visitor interface, and a PrettyPrinter.

We will release an AutoTester soon. So please make sure that your work meets our requirements. We reserve the right to assign 0 points to any submissions which cannot be automatically unzipped and tested. Additionally, we reserve the right to assign 0 points to any submission which 'games' the automated testing by using a lookup which produces only outputs that correspond to the test cases we happen to use. Be sure to submit the version of the project that you want graded, as we won't regrade if you submit the wrong version by accident.

Enjoy!


Adapted from a similar document from CS 141 Winter 2013 by Alex Thornton,