CS132 Compiler Construction
Spring 2003


Programming Language SRC3Description

1. Lexical issues

During lexical analysis, characters in SRC3 source text are reduced to a series of tokens. The SRC3 compiler recognizes five kinds of tokens: keywords, identifiers, literals, operators , and separators. Comments and white spaces such as blanks (spaces), tabs, and line feeds are not tokens and will be discarded.

1.1 Comments

Comments are the same as in C-, except nesting is allowed. You may need to deal with nested comments by hand.

1.2 Keywords

The following are reserved for use as keywords, and all letters must be in lowercase.

 

var

int

real

string

bool

array

of

 

 

begin

end

 

 

read

write

writeln

 

if

then

else

 

while

do

 

 

return

 

 

 

1.3 Identifiers

Identifiers must start with a letter, subsequent characters can also contain digits. Letters are the characters 'A' through 'Z', 'a' through 'z' . Case of letters is relevant.

1.4 Literals

Literals are the basic representation of any integer, real number, or string value, which often imply unsigned. In other words, your scanner must recognize -1 as a minus sign ( '-') and an integer literal 1, so as -2e-3 as a '-' and a real literal 2e-3.

1.4.1 Integer literals

An integer literal consists of a sequence of digits. Possible leading zero's are allowed.

1.4.2 Real number literals

Real number literals are similar to C's doubles. For instance, 1.2, 3.0, 4e-5 , and .50E+06 are all valid real number literals. But no real can end in '.' so that e.g. '2.' or '.' alone are not valid reals. Similarly, '.e6' or '2.e6' are not valid reals.

1.4.3 String literals

A string literal is zero or more characters enclosed in double quotes. All characters (including double quotes) need to be on the same line. Only ASCII characters are used in SRC3 , no control characters.
A double quote appearing within a string must be written by doubling it. For example, "a""b" denotes the string literal a"b .

1.5 Operators and separators

The following characters are used as operators
 

!

*

/

%

+

-

<

>

=

or separators.
 

[

]

(

)

:

;

,

The following character combinations are used as operators
 

<>

<=

>=

||

&&

:=

or separator.
 

..

where .. stands for the notation of (integer) subranges. For instance, 1..20 contains the integers from 1 to 20, and -20..-1 contains the integers from -20 to -1.
SRC3 does not allow a number that ends in a dot immediately followed by a number (or token) that begins with a dot. Therefore, for input 1..20 your scanner must recognize it as 1, .. , and 20, not as 1. and .20.

2. Types

Four base types are used: 'int', 'real', 'string', and 'bool'.

There is no formal 'char' type.

In addition to these base types, there is an array type. The array is indicated by the keyword 'array ' followed by range information enclosed in square brackets '[' and ']' then the keyword 'of' followed by one of the base types. Multiple dimensioned arrays are supported.

The range information consists of a comma-separated list of subranges, one for each dimension.

Each subrange may be a single positive integer (equal to the number of elements contained), i.e. a shorthand; way of defining the range from zero to the specified size - 1.

Or a subrange may be a pair of integers specifying the first and last indices, separated by '..' with either or both of the integers preceded by a unary '-', implying a negative value for the corresponding index.

The integers used for range information may be literals (with or without unary minus), or IDs that refer to such information.
The type bool is available to serve as the result of comparisons or logical operations. It can take only the values 'true' or 'false'. Variables of other base types may be 'cast' to bool by assignment to a variable of type bool. Variables of type int or real with value = 0 or string of length zero ('empty') produce bool with value 'false'; otherwise the result is 'true'. (Casting from bool is not supported.)

3. Definitions

3.1 Variable

Either a single declared variable specified by its ID, (for example: 'alpha').

Or a single element of an array, specified by:
the array ID, '[', <index (or comma separated indices) in the array>, ']',
(for example: Br[4] or Ar[2,3,4].

3.2 StatementList

A (possibly empty) list of statements. If the statement list is not empty, each statement is terminated by a semicolon ';'

3.3 ExpressionList

A comma separated list of Expressions.

3.4 Expression

An expression is a literal, an identifier, a variable, a subroutine call or one of the following three types of expression: Arithmetic Expression, Relational Expression, and Logical Expression.

Parentheses may be used to group to dictate evaluation precedence. Parenthesized subexpressions are evaluated as single entities, starting from the innermost.

If X is an expression then ( X ) is also an expression

3.4.1 Logical Expression

A Logical Expression produces a result of type bool, and is of the form:

3.4.2 Relational Expression

A Relational Expression produces a result of type bool, and is of the form X <Relop> Y where X and Y are Expressions

Relop means: '>', '>=', '<', '<=', '=', '<>'

3.4.3 Arithmetic Expression

An Arithmetic Expression is of the form: X <op> Y 

·         Here, X or Y can be literal constants possibly preceded by an optional unary '+' or '-' sign.

·         Or they can be other Expressions .

The <op> can be:

·         an Addop: '+' or '-',

·         or a Mulop: '*' , '/', or '%' (the mod operator).

Operator precedence is similar to that of C (listed in increasing precedence):

1.      ,

2.      :=

3.      ||

4.      &&

5.      =, <>, <, >, <=, >=

6.      +, - (binary op)

7.      *, /, %

8.      !, - (unary op)

9.      (, ), [, ]

Sequences with only Addops are evaluated left to right since all Addops have the same precedence. Sequences with only Mulops are also evaluated left to right since all Mulops have the same precedence.

Parentheses may be used to force operations to occur in the specified order.

4. Statement Block

A statement block has no name or associated ID. It starts with the keyword 'begin ' followed by its (possibly empty) body, and is terminated by the keyword 'end '

The block body begins with a (possibly empty) declaration list defined just the same way as the declaration list for a program, specified below.

The block body ends with a <StatementList> which may be empty.

Blocks may be nested.

Allowable Statements:

A small but adequate set of statements are used in SRC3:

I/O statements

·         'read' < Variable>, which allows reading a single variable at a time.

·         'write' < List of Expressions>, which allows writing a list of values. For example, even a function-type subroutine (hence its returned value) can be the argument of a write.

Conditionals:

·         'if' <Expression> 'then' <Statement List> ' end' 'if'

·         'if' <Expression> 'then' <Statement List> ' else' <Statement List> 'end' 'if'

Flow control statements:

·         'while' <Expression> 'do' <Statement List> 'end' ' while'

·         'return ' <Expression>

A statement block may also be thought of as an allowable statement.

5. Program Structure

A SRC3 program consists of a heading, then the body of the program, then the keyword 'end ' followed by the ID.

No keyword such as 'program ' is needed nor allowed; the fact that you are dealing with a program is inferred from context.

The program heading consists of an ID naming the program, followed by a semicolon.

The program body contains a (possibly empty) declaration list consisting of declarations separated by semicolons, followed by a (possibly empty) Subroutine list followed by a statement block .

5.1 Declarations

If the declaration list is not empty, it contains either variable-declarations or constant-declarations, or both, separated by semicolons, followed by a (possibly empty)  subroutine list . 

5.1.1 Variable-declarations

Each variable-declaration consists of the key word 'var ' followed by a comma separated list of IDs, then a colon, then the variable type.

·         This may be one of the base types.

·         Or it may be an array of one of the base types.

5.1.2 Constant Declarations:

Constants must also be declared and their values must be set in the declaration section. Since the value cannot be changed, it would be illogical, and is an error, to attempt to set the value by an assignment.

Each constant-declaration consists of the key word 'var' followed by a comma separated list of IDs, then a colon, then the value of the constant. The type of the constant is not stated but is inferred from the type of its value.

Constants may be set to an integer or real value, (preceded by a ' +' or a '-' or by no sign -- implying positive --. They may be set to a specified string. Or they may be set to another constant by listing its ID after the colon.

5.1.3 Subroutine List:

If there are any subroutines (a function is just a kind of subroutine that returns a value) they must then also be declared, one after the other. There are no commas or semicolons to separate these declarations.

A subroutine-declaration looks like a program, but includes arguments and return type (if any) in the heading.

A subroutine-declaration consists of a heading , then the body of the subroutine, then the keyword 'end ' followed by the ID.
The subroutine heading consists of

·         an ID naming it, then an open parenthesis '('

·         followed by a (possibly empty list) of 'pass-by-value' argument-sets (separated by semicolons if there are more than one),

·         followed by a (possibly empty list) of 'pass-by-reference' argument-sets (separated by semicolons if there are more than one),

o        This list must be preceded by the key word var if it is not empty

·         all followed by closing parenthesis ')'.

·         The parentheses must be present even if there are no arguments.

e.g.

DoOrNot (a,b,c: real; e,f: int; var g,h: real; k,l: int): bool;

Each argument-set consists of a comma separated list of IDs, then a colon ' :', then the variable type.

Any type is allowed as an argument, including arrays.

Even if there are no arguments, the parentheses are needed.

The heading continues with a colon ' : ' then the type of the return value. A semicolon ends the heading. There ALWAYS has to be a return value

No keyword such as ' subroutine ' is needed nor allowed; the fact that you are dealing with a subroutine-declaration is inferred from context.

The body of a subroutine consists of a statement block.