Programming Excercise 1: Scanner

Due Tue April 22 23:59 (=11:59 PM)


General Description:

Construct a scanner for the SRC3 Language described on the official class page in the file. http://www.cs.ucla.edu/classes/spring03/cs132/src3.html.

(In addition to comments as described, begun by /* and ended by */ which may be multiline, and nested, we could also have allowed single line comments begun by // and ended by a carriage return (new line), with the first comment opening symbol deciding the style of the comment. Consider how that would be implemented. It is NOT part of this assignment.)

Details of the output format will be covered in the discussion sections, and posted on the class web page, as will details of the electronic submission of your program.

You should use LEX, or FLEX, for this exercise, and should base your program on the finite automata approach described in class and in the Louden textbook.

Since the later stages will call on the scanner to deliver tokens as needed by them, you may find it convenient to have a driver routine to get tokens, one at a time, and to do the rest of the processing.


Specification

Please adhere to the specification described here as much as possible since we are mostly running scripts comparing the outputs. Deviations could result in point loss unless you come and argue about it. However, that would cost extra time and efforts for both you and graders.

Input

The program should assume input from standard input (which can be redirected by you or a grader to read any specified file).

Token types

Keywords

Each keyword has a token of the same name, but in uppercase. (i.e. var has a token called VAR). There is one exception: The keyword begin has a token of KWBEGIN, to avoid name conflict with a lex macro which is named BEGIN.

Identifiers

Token type ID (function call, variable name, etc.) Name of the id should be included (e.g. ID(variable)).

Operators

Operator

Token Name

!

NOT

*

MUL

/

DIV

%

MOD

+

PLUS

-

MINUS

<

LT

>

GT

=

EQ

[

LBRACKET

]

RBRACKET

(

LPAREN

)

RPAREN

:

COLON

;

SEMI

,

COMMA

<>

NEQ

<=

LTE

>=

GTE

||

OR

&&

AND

:=

ASSIGN

..

SUBRANGE

Literals

Literal type

Token Name

integers

LITINT

real

LITREAL

string

LITSTRING

Value of the literal should be included (e.g. LITREAL(4.5)). In case of string literal, string length must be also reported. It is possible to have an empty string "", which corresponds to token LITSTRING(;0). For input "a""b", the token is LITSTRING(a"b;3).

Note: In case of """ or similar, the correct output should be a LITSTRING(;0) token then followed by an error about missing quote.

Note, there is no boolean literal.

Output

The output is standard output.

Your output should identify each token in the input file, should include the line number of occurrence. In the following format

token:\tline#

\t indicates the tab character. Example:

ID(testProgram):       1
SEMI:   1
BEGIN:  2
INTEGER(5):    10

Note, the first line is line 1.

You can use the sample programs in SRC3 that are provided as test programs, and you should include a collection of files to test the scanner, but your scanner should accept and analyze any SRC3 program.

Errors: If the scanner detects a lexical error, it should indicate the error as well as possible, the exit gracefully. The format is the following

ERROR LINE line#:\terror msg.

Example:

ERROR LINE 10: unclosed comment.

What to turn in

·         All the files needed to build the program, including Makefile, etc. The executable name should be src3.

·         README. Simple text file, no word/html documents please. It should describe

o        Your name, id, lecture and section #.

o        Any sources of information which you used/borrowed other than the book, including but not limited to who you discussed with (just in case that person might steal code from you), the web source of the code (if you downloaded the code from somewhere).

o        Any "advanced" features of the program for extra-credit purpose (we may or may not have any extra-credits for the project, but just in case, put them in).

·         Some simple test input files.

·         All of the above should be put in a tar file. gzip compressed or not does not matter. Only the file w/ the latest time stamp is graded. Be sure to turn in the assignment on time.


UPDATE: April 17, 2003

1.      ERROR MESSAGES

You are expected to detect the following list of errors an print the following error messages:

ERROR LINE #: Unknown Symbol: <symbol_name>

Example:        ERROR LINE 5: Unknown Symbol: @

            ERROR LINE #: Unclosed Comment

                        Example:        ERROR LINE 3: Unclosed Comment         

(Assume the comment begins on line 3; the error will

actually be detected at the end of the file)

            ERROR LINE #: Unclosed String

                        Example:        ERROR LINE 4: Unclosed String

                        (Assume the string begins on line 4. It is an error if

it is closed on any line x, x > 4 or remains unclosed

at the end of the file)

2.      CLARIFICATIONS REGARDING ”””

The basic rule of thumb for evaluating this is as follows:

“Match the longest string you can first!”

For example, consider “””

At first, we see Q1. We acknowledge the beginning

of the string. When we see Q2, we acknowledge the

possibility that Q1-Q2 is an empty string, but we haven’t

considered the longest possible match. We continue to Q3.

At this point, we see that Q2-Q3 could print the “ character

in a string started by Q1. We continue looking for the end

of the string…

 

Instead of finding the end of the string, we hit an <end-of-line>

character—an unclosed string and the end of the line. We do

not, however, report the error at this point because all that we

know is that the LONGEST MATCH failed. Before determining

whether there is an error, we must try to match shorter strings—

so we match Q1-Q2 and print LITRSTR().

 

Having matched Q1 and Q2, we continue with Q3, which begins a

string. The next character we match is an  <end-of-line>. There

are no possible shorter matches, so THIS is the error that is reported.