CompSci 142 / CSE 142 Winter 2018 | News | Course Reference | Schedule | Project Guide
This webpage was adapted from Alex Thornton’s offering of CS 141


CompSci 142 / CSE 142 Winter 2018
Project #1: Scanner

Due date and time: 11:59pm, Thursday, January 18, 2018.  Late homework will not be accepted.


Introduction

In this class we follow the traditional compiler design approach, starting with the front-end and proceeding toward the back-end. As a whole, the compiler will make many incremental steps, progressively transforming source input to executable output. Operating directly on the characters of a given input source code is cumbersome, so we first seek to raise our level of abstraction.

The first step of translation from source to executable program is to perform lexical analysis. In this lab, we shall write a crux.Scanner that transforms an input text into a stream of crux.Tokens.

What's a crux.Token?

Tokens allow us to break up the input source text into more meaningful chunks. Each token encapsulates a short string of characters that carries some meaning within the Crux language. For example, the NOT_EQUAL token represents the != operator and the VAR token represents the var keyword. Each crux.Token therefore carries information about its Kind.

In addition to representing the operators and keywords we also associate literal values, such as numbers and names with a token. For example, the NUMBER token represents the character sequence 1.8 and the IDENTIFIER token represents the variable name foo. Later stages of the compiler will need to query these token for their lexical string values. For example, according to the Crux grammar, when the crux.Parser (not implemented until Lab 2) comes across a function call it actually sees an IDENTIFIER token. In order to find out which function is being called, the CruxParser asks the IDENTIFIER token for a string value representing the function's name. To support this convenient behavior, each crux.Token also carries the lexeme that it represents. I found it convenient to implement factory functions for these kinds of tokens.

Finally, in order to support error reporting later, each crux.Token also carries with it the line number and character position of where it occurred within the input source file.

Class Descriptions

crux.Token : Holds token information.
- kind    : The crux.Token.Kind
- lineNum : The line number where this token occurs
- charPos : The character position at which this token starts
- lexeme  : The token's lexical contents
 
crux.Token.Kind : An enum of all possible token kinds.
    

Design Hints

Note that Java Enums are quite powerful. You can define your own constructor and custom fields. We can use this feature to define a mapping from names (NOT_EQUAL) to lexemes (!=). There is also a .values() method which allows for-each iteration over all declared enum instances. This method comes in very handy for separating keywords from identifiers, because we can then match a string read from the input text to a corresponding enum instance.

You may also find certain convenience methods worth implementing. I found that I wanted a method for asking a crux.Token what kind it was, so I implemented a method with the signature: boolean is(Kind kind). For most tokens a crux.Token(String lexeme, int lineNum, int charPos) constructor was enough. But some kinds of token carry a lexeme dependent on the input source text. For these tokens, I implemented static factory functions such as crux.Token.Identifier(String name, int lineNum, int charPos) and crux.Token.Float(String num, int lineNum, int charPos).

Scanning the Input

The crux.Scanner is responsible for reading an input file and converting it to a stream of crux.Tokens. Our scanner will read the input one character at a time, implementing a very light weight version of the State Machine Pattern. I say lightweight, because the tokens that we must recognize breakdown into well separated categories: one character, two character, and many character. The Crux language has very little overlap between these categories, so not much state needs to be preserved. In fact, the language permits a scanner, whose entire state can be held within a nextChar, that stores the value of the next character read from the file a nextChar There is no need for the crux.Scanner to perform any look ahead.

The crux.Scanner returns the stream of tokens through a single method: CruxToxen next(). Just as with an Java Iterator, the scanner incrementally converts the input text into tokens, returning one crux.Token for each call to next(). Some tokens, such as an IDENTIFIER, represent many characters of the input text. However, the scanner reads only as many characters as required to construct the token being returned. Once the scanner has reached the end-of-file, successive calls to next() shall all return an EOF token.

Class Descriptions

crux.Scanner : Converts input source into crux.Token stream.
- lineNum : the current line number in the input.
- charPos : the current character position on the current line.
- nextChar : the next character read from the input text.
- crux.Token next(): returns the next crux.Token in the stream.
    

Design Hints

  • Although not required, you may consider having crux.Scanner implement Iterable<crux.Token> so that testing is easier.
  • The next() method should maintain the following invariant: before returning a Token the next character from the input source has been read into nextChar, but has not been inspected (turned into a Token). This 'primes the pump' for another call to next(), and means you may implement next() under the assumption that, when it begins, nextChar contains a char that has not been tokenized.
  • I personally found it convenient to implement a readChar() method that reads a single character from the input and updates the lineNum and charPos information.

What do I need to implement?

The Lexical Semantics section of the Crux Specification. Your program should be able to read input from a text file, splitting that file up into a stream of tokens, and printing the information about each token to System.out.

For convenience, you may get a start on this lab by using a pre-made Lab1_Scanner.zip project, which contains a crux.Compiler skeleton for the reading and printing parts of the project.

Testing

Test cases are available in this tests.zip file. The provided tests are not meant to be exhaustive. You are strongly encouraged to construct your own. (If chrome gives you a warning that "tests.zip is not commonly downloaded and could be dangerous" it means that Google hasn't performed an automated a virus scan. This warning can be safely ignored, as the tests.zip file contains only text files.)

Restrictions

  • The crux.Scanner must not read the entire input in one shot. Rather, it needs to read only as much as necessary to produce each token one at a time.
  • Do not use the Regular Expression library, built-in Scanner class, or other advanced character processing functions. The objective is to implement our own very simplistic finite state machine. Simple functions such as Character.isWhiteSpace() and the like are fine.
  • Even if you choose not to use the provided starting project, you must use the same token names, values, and lexemes as indicated by the Crux specification.

Deliverables

A zip file, named last_first_projectx.zip, containing the following files (in the crux package):

  • crux.Scanner.java, which performs incremental tokenization of an input text.
  • crux.Compiler.java, which houses the main() function that begins your program.
  • crux.Token.java, which represents a string of characters read in the input text.

Please submit the zip file to the appropriate project dropbox via the EEE course website. Be sure to submit the version of the project that you want graded, as we won't regrade if you submit the wrong version by accident.


·         Adapted from a similar document from CS 141 Winter 2013 by Alex Thornton,