CS132 Compiler Construction
Spring 2003
|
Programming Language SRC3Description |
During lexical analysis, characters in SRC3 source text are reduced to a series of tokens. The SRC3 compiler recognizes five kinds of tokens: keywords, identifiers, literals, operators , and separators. Comments and white spaces such as blanks (spaces), tabs, and line feeds are not tokens and will be discarded.
Comments are the same as in C-, except nesting is allowed. You may need to deal with nested comments by hand.
The following are reserved for use as keywords, and all letters must be in lowercase.
|
|
|||
|
var |
|||
|
int |
real |
string |
bool |
|
array |
of |
|
|
|
begin |
end |
|
|
|
read |
write |
writeln |
|
|
if |
then |
else |
|
|
while |
do |
|
|
|
return |
|
|
|
Identifiers must start with a letter, subsequent characters can also contain digits. Letters are the characters 'A' through 'Z', 'a' through 'z' . Case of letters is relevant.
Literals are the basic representation of any integer, real number, or string value, which often imply unsigned. In other words, your scanner must recognize -1 as a minus sign ( '-') and an integer literal 1, so as -2e-3 as a '-' and a real literal 2e-3.
An integer literal consists of a sequence of digits. Possible leading zero's are allowed.
Real number literals are similar to C's doubles. For instance, 1.2, 3.0, 4e-5 , and .50E+06 are all valid real number literals. But no real can end in '.' so that e.g. '2.' or '.' alone are not valid reals. Similarly, '.e6' or '2.e6' are not valid reals.
A string literal is zero or more characters enclosed in
double quotes. All characters (including double quotes) need to be on the same
line. Only ASCII characters are used in SRC3 , no control characters.
A double quote appearing within a string must be written by doubling it. For
example, "a""b"
denotes the string literal a"b
.
The following characters are used as operators
|
! |
* |
/ |
% |
+ |
- |
< |
> |
= |
or separators.
|
[ |
] |
( |
) |
: |
; |
, |
The following character
combinations are used as operators
|
<> |
<= |
>= |
|| |
&& |
:= |
or separator.
|
.. |
where .. stands for the
notation of (integer) subranges. For instance, 1..20 contains the integers from
1 to 20, and -20..-1 contains the integers from -20 to -1.
SRC3
does not allow a number that ends in a dot immediately followed by a
number (or token) that begins with a dot. Therefore, for input 1..20 your
scanner must recognize it as 1, .. , and 20, not as 1.
and .20.
Four base types are used: 'int', 'real', 'string', and 'bool'.
There is no formal 'char' type.
In addition to these base types, there is an array type. The array is indicated by the keyword 'array ' followed by range information enclosed in square brackets '[' and ']' then the keyword 'of' followed by one of the base types. Multiple dimensioned arrays are supported.
The range information consists of a comma-separated list of subranges, one for each dimension.
Each subrange may be a single positive integer (equal to the number of elements contained), i.e. a shorthand; way of defining the range from zero to the specified size - 1.
Or a subrange may be a pair of integers specifying the first and last indices, separated by '..' with either or both of the integers preceded by a unary '-', implying a negative value for the corresponding index.
The integers used for range
information may be literals (with or without unary minus), or IDs that refer to
such information.
The type bool is available to serve as the result of comparisons or
logical operations. It can take only the values 'true' or 'false'. Variables of
other base types may be 'cast' to bool by assignment to a variable of
type bool. Variables of type int or real with value = 0 or
string of length zero ('empty') produce bool with value 'false';
otherwise the result is 'true'. (Casting from bool is not
supported.)
Either a single declared variable specified by its ID, (for example: 'alpha').
Or a single element of an array,
specified by:
the array ID, '[', <index (or comma separated indices) in the
array>, ']',
(for example: Br[4] or Ar[2,3,4].
A (possibly empty) list of statements. If the statement list is not empty, each statement is terminated by a semicolon ';'
A comma separated list of Expressions.
An expression is a literal, an identifier, a variable, a subroutine call or one of the following three types of expression: Arithmetic Expression, Relational Expression, and Logical Expression.
Parentheses may be used to group to dictate evaluation precedence. Parenthesized subexpressions are evaluated as single entities, starting from the innermost.
If X
is an expression then ( X ) is also an expression
A Logical Expression produces a result of type bool, and is of the form:
A Relational Expression produces a result of type bool, and is of the form X <Relop> Y where X and Y are Expressions
Relop means: '>', '>=', '<', '<=', '=', '<>'
An Arithmetic Expression is of the form: X <op> Y
· Here, X or Y can be literal constants possibly preceded by an optional unary '+' or '-' sign.
· Or they can be other Expressions .
The <op> can be:
· an Addop: '+' or '-',
· or a Mulop: '*' , '/', or '%' (the mod operator).
Operator precedence is similar to that of C (listed in increasing precedence):
1. ,
2. :=
3. ||
4. &&
5. =, <>, <, >, <=, >=
6. +, - (binary op)
7. *, /, %
8. !, - (unary op)
9. (, ), [, ]
Sequences with only Addops are evaluated left to right since all Addops have the same precedence. Sequences with only Mulops are also evaluated left to right since all Mulops have the same precedence.
Parentheses may be used to force operations to occur in the specified order.
A statement block has no name or associated ID. It starts with the keyword 'begin ' followed by its (possibly empty) body, and is terminated by the keyword 'end '
The block body begins with a (possibly empty) declaration list defined just the same way as the declaration list for a program, specified below.
The block body ends with a <StatementList> which may be empty.
Blocks may be nested.
A small but adequate set of statements are used in SRC3:
· 'read' < Variable>, which allows reading a single variable at a time.
· 'write' < List of Expressions>, which allows writing a list of values. For example, even a function-type subroutine (hence its returned value) can be the argument of a write.
· 'if' <Expression> 'then' <Statement List> ' end' 'if'
· 'if' <Expression> 'then' <Statement List> ' else' <Statement List> 'end' 'if'
· 'while' <Expression> 'do' <Statement List> 'end' ' while'
· 'return ' <Expression>
A statement block may also be thought of as an allowable statement.
A SRC3 program consists of a heading, then the body of the program, then the keyword 'end ' followed by the ID.
No keyword such as 'program ' is needed nor allowed; the fact that you are dealing with a program is inferred from context.
The program heading consists of an ID naming the program, followed by a semicolon.
The program body contains a (possibly empty) declaration list consisting of declarations separated by semicolons, followed by a (possibly empty) Subroutine list followed by a statement block .
If the declaration list is not empty, it contains
either variable-declarations or constant-declarations, or both, separated by
semicolons, followed by a (possibly
empty) subroutine list .
Each variable-declaration consists of the key word 'var ' followed by a comma separated list of IDs, then a colon, then the variable type.
· This may be one of the base types.
· Or it may be an array of one of the base types.
Constants must also be declared and their values must be set in the declaration section. Since the value cannot be changed, it would be illogical, and is an error, to attempt to set the value by an assignment.
Each constant-declaration consists of the key word 'var' followed by a comma separated list of IDs, then a colon, then the value of the constant. The type of the constant is not stated but is inferred from the type of its value.
Constants may be set to an integer or real value, (preceded by a ' +' or a '-' or by no sign -- implying positive --. They may be set to a specified string. Or they may be set to another constant by listing its ID after the colon.
If there are any subroutines (a function is just a kind of subroutine that returns a value) they must then also be declared, one after the other. There are no commas or semicolons to separate these declarations.
A subroutine-declaration looks like a program, but includes arguments and return type (if any) in the heading.
A subroutine-declaration consists
of a heading , then the body of the subroutine, then the keyword
'end ' followed by the ID.
The subroutine heading consists of
· an ID naming it, then an open parenthesis '('
· followed by a (possibly empty list) of 'pass-by-value' argument-sets (separated by semicolons if there are more than one),
· followed by a (possibly empty list) of 'pass-by-reference' argument-sets (separated by semicolons if there are more than one),
o This list must be preceded by the key word var if it is not empty
· all followed by closing parenthesis ')'.
· The parentheses must be present even if there are no arguments.
e.g.
DoOrNot (a,b,c: real; e,f: int; var g,h: real; k,l: int): bool;
Each argument-set consists of a comma separated list of IDs, then a colon ' :', then the variable type.
Any type is allowed as an argument, including arrays.
Even if there are no arguments, the parentheses are needed.
The heading continues with a colon ' : ' then the type of the return value. A semicolon ends the heading. There ALWAYS has to be a return value
No keyword such as ' subroutine ' is needed nor allowed; the fact that you are dealing with a subroutine-declaration is inferred from context.
The body of a subroutine consists of a statement block.