1 Definition of Grammars
3 Parse Trees
5 Associativity of Operators
6 Precedence of Operators
In this section, we introduce a notation - the
"context-free grammar," or "grammar" for short - that is used to specify the syntax of a language. Gram-mars will be
used throughout this book to organize compiler front ends.
A grammar naturally describes the hierarchical structure of most
program-ming language constructs. For example, an if-else statement in Java can
have the form
if ( expression ) statement else statement
That is, an if-else statement is the concatenation of the keyword if, an
open-ing parenthesis, an expression, a closing parenthesis, a statement, the
keyword else, and another statement. Using the variable expr to denote an
expres-sion and the variable stmt to denote a statement, this structuring rule
can be expressed as
stmt -> + if
) stmt else stmt
in which the arrow may be read as "can have the form." Such a
rule is called a production. In a production, lexical elements like the keyword
if and the paren-theses are called terminals. Variables like expr and stmt
represent sequences of terminals and are called nonterminals.
1 Definition of Grammars
A context-free grammar has four components:
1.A set of
terminal symbols, sometimes referred to as "tokens." The termi-nals
are the elementary symbols of the language defined by the grammar.
A set of
nonterminals, sometimes called "syntactic variables." Each
non-terminal represents a set of strings of terminals, in a manner we shall
3. A set of productions, where each production consists of a
nonterminal, called the head or left side of the production, an arrow, and a
sequence of terminals and1or
nonterminals, called the body or right side of the produc-tion. The
intuitive intent of a production is to specify one of the written forms of a
construct; if the head nonterminal represents a construct, then the body
represents a written form of the construct.
4. A designation of one of the
nonterminals as the start symbol.
We specify grammars by listing their productions, with the productions
for the start symbol listed first. We assume that digits, signs such as < and <=,
and boldface strings such as while are terminals. An italicized name is a nonterminal,
and any nonitalicized name or symbol may be assumed to be a terminal.' For
notational convenience, productions with the same nonterminal as the head can have
their bodies grouped, with the alternative bodies separated by the symbol 1 , which we read as "or."
Example 2.1 : Several examples in this chapter use expressions consisting of digits and plus and minus signs; e.g.,
strings such as 9-5+2, 3-1, or 7. Since a plus or minus sign
must appear between two digits, we refer to such expressions as "lists of
digits separated by plus or minus signs." The following grammar describes
the syntax of these expressions. The productions are:
The bodies of the three productions with nonterminal list as head equivalently can be grouped:
According to our conventions, the terminals of the grammar are the
The nonterminals are the italicized names list and digit, with list being the start symbol because its
productions are given first.
We say a production is for a
nonterminal if the nonterminal is the head of the production. A string of
terminals is a sequence of zero or more terminals. The string of zero
terminals, written as E , is called the e m p t y
A grammar derives strings by beginning with the start symbol and
repeatedly replacing a nonterminal by the body of a production for that
nonterminal. The terminal strings that can be derived from the start symbol
form the language defined by the
Example 2.2 : The language defined by the grammar of Example 2.1 consists of lists of digits separated by plus and minus signs. The ten
productions for the nonterminal digit
allow it to stand for any of the terminals 0 , 1 , . . . ,9 . From production (2.3),
a single digit by itself is a list. Productions (2.1) and (2.2) express the
rule that any list followed by a plus or minus sign and then another digit
makes up a new list.
Productions (2.1) to (2.4) are all we need to define the desired
language. For example, we can deduce that 9-5+2 is a list as follows.
a) 9 is a list by production
(2.3), since 9 is a digit.
b) 9-5 is a list by production
(2.2), since 9 is a list and 5 is a digit.
c) 9-5+2 is a list by production
(2.1), since 9-5 is a list and 2 is a
Example 2.3 : A somewhat different sort of list is the list of parameters in a function call. In Java, the parameters are enclosed within parentheses,
as in the call max(x, y) of function max with parameters
x and y. One nuance of such lists is that an empty list of parameters may be
found between the terminals ( and ). We may
start to develop a grammar for such sequences with the productions:
Note that the second possible body for optpamm s ("optional parameter list") is t, which stands for the empty string of symbols. That is, optparams can be replaced by the empty
string, so a call can consist of a
function name followed by the two-terminal string () . Notice that the productions for params
are analogous to those for dist in
Example 2.1, with comma in place of
the arithmetic operator + or -, and param in place of digit. We have not shown the productions
for param, since parameters are
really arbitrary expressions. Shortly, we shall discuss the appropriate
productions for the various language constructs, such as expressions,
statements, and so on.
Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar, and if it cannot be
derived from the start symbol of the grammar, then reporting syntax errors
within the string. Parsing is one of the most fundamental problems in all of
compiling; the main approaches to parsing are discussed in Chapter 4. In this
chapter, for simplicity, we begin with source programs like 9-5+2 in which each character is a
terminal; in general, a source program has multicharacter lexemes that are
grouped by the lexical analyzer into tokens, whose first components are the
terminals processed by the parser.
3 Parse Trees
A parse tree pictorially shows how the start symbol of a grammar derives a string in the language. If nonterminal A has a production A -+ X Y Z ,
then a parse tree may have an interior node labeled A with three children labeled X, Y, and Z, from left to right:
Formally, given a context-free grammar, a parse tree according to the gram-mar is a tree with the following
1. The root is labeled by the start symbol.
Each leaf is labeled by a
terminal or by e.
Each interior node is labeled by
If A is the nonterminal labeling some interior node and X I , Xz, . . . ,Xn are the labels of the children of that node from left to right, then
there must be a production A -> X1X2 . . Xn . Here, X1, X2,. . . ,X, each stand for a symbol that
is either a terminal or a nonterminal. As a special case, if A -> c is a production, then a node
labeled A may have a single child labeled E .
Example 2.4: The derivation of 9-5+2 in Example 2.2 is illustrated by
the tree in Fig. 2.5. Each node in
the tree is labeled by a grammar symbol. An interior node and its children
correspond to a production; the interior node corresponds to the head of the
production, the children to the body.
In Fig. 2.5, the root is labeled
list, the start symbol of the grammar in Example 2.1. The children of the root
are labeled, from left to right, list, +, and
digit. Note that
list -> list + digit
is a production in the grammar of Example 2.1. The left child of the
root is similar to the root, with a child labeled - instead of +. The three nodes labeled digit
each have one child that is labeled by a digit.
From left to right, the leaves of a parse tree form the yield of the
tree, which is the string generated or derived from the nonterminal at the root
of the parse
tree. In Fig. 2.5, the yield is 9-5+2; for convenience, all the leaves
are shown at the bottom level. Henceforth, we shall not necessarily line up the
leaves in this way. Any tree imparts a natural left-to-right order to its leaves,
based on the idea that if X and Y
are two children with the same parent, and X is to the left of Y, then all
descendants of X are to the left of descendants of Y.
Another definition of the language generated by a grammar is as the set
of strings that can be generated by some parse tree. The process of finding a
parse tree for a given string of terminals is called parsing that string.
We have to be careful in talking about the structure of a string according to a grammar. A grammar can
have more than one parse tree generating a given string of terminals. Such a
grammar is said to be ambiguous. To
show that a grammar is ambiguous, all we need to do is find a terminal string
that is the yield of more than one parse tree. Since a string with more than
one parse tree usually has more than one meaning, we need to design unambiguous
grammars for compiling applications, or to use ambiguous grammars with
additional rules to resolve the ambiguities.
Example 2.5 : Suppose we used a single nonterminal
string and did not dis-tinguish between digits and lists, as in Example
2.1. We could have written the grammar
Merging the notion of digit
and list into the nonterminal string makes superficial sense, because
a single digit is a special case of a
However, Fig. 2.6 shows that an expression like 9-5+2 has more than one
parse tree with this grammar. The two trees for 9-5+2 correspond to the two
ways of parenthesizing the expression: (9-5) +2 and 9- (5+2) . This second parenthesization gives the expression the unexpected value
2 rather than the customary value 6. The grammar of Example 2.1 does not permit
5. Associativity of Operators
By convention, 9+5+2 is
equivalent to (9+5)+2 and 9 - 5 - 2 is
equivalent to ( 9 - 5 ) - 2 . When an operand like 5 has
operators to its left and right, con-ventions are needed for deciding which
operator applies to that operand. We say that the operator + associates to
the left, because an operand with plus signs on both sides of it belongs to the
operator to its left. In most programming languages the four arithmetic
operators, addition, subtraction, multiplication, and division are
Some common operators
such as exponentiation are right-associative. As another example, the
assignment operator = in C and its descendants is right-associative; that is,
the expression a=b=c is treated in the same way as the expression a =(b=c)
Strings like a=b=c
with a right-associative operator are generated by the
The contrast between a
parse tree for a left-associative
operator like - and a parse tree for a right-associative operator like = is
shown by Fig. 2.7. Note that the parse tree for 9 - 5 - 2 grows down towards
the left, whereas the parse tree for a-b=c grows down towards the right.
6 Precedence of Operators
Consider the expression 9+5*2.
There are two possible interpretations of this expression: (9+5)*2 or 9+(5*2). The
associativity rules for + and * apply to occurrences of the same operator, so
they do not resolve this ambiguity. Rules defining the relative precedence of
operators are needed when more than one kind of operator is present.
We say that * has higher
precedence than + if * takes its operands before + does. In ordinary
arithmetic, multiplication and division have higher precedence than addition
and subtraction. Therefore, 5 is
taken by * in both 9+5*2 and 9*5+2; i.e., the expressions are equivalent
to 9+(5*2) and (9*5)+2, respectively.
Example 2.6 : A grammar for arithmetic expressions can be constructed from a table showing the associativity and
precedence of operators. We start with the four common arithmetic operators and
a precedence table, showing the operators in order of increasing precedence.
Operators on the same line have the same associativity and precedence:
left-associative: + -
left-associative: * /
We create two nonterminals expr
and term for the two levels of
precedence, and an extra nonterminal factor
for generating basic units in expressions. The basic units in expressions are
presently digits and parenthesized expressions.
factor —> digit | ( expr)
Now consider the binary operators, * and /, that have the highest
prece-dence. Since these operators associate to the left, the productions are
similar to those for lists that associate to the left.
Generalizing the Expression
Grammar of Example 2.6
We can think of a factor as an expression that cannot be "torn
apart" by any operator. By "torn apart," we mean that placing an
operator next to any factor, on either side, does not cause any piece of the
factor, other than the whole, to become an operand of that operator. If the
factor is a parenthesized expression, the parentheses protect against such
"tearing," while if the factor is a single operand, it cannot be torn
A term (that is not also a factor) is an expression that can be torn
apart by operators of the highest precedence: * and /, but not by the
lower-precedence operators. An expression (that is not a term or factor) can be
torn apart by any operator.
We can generalize this idea to any number n of precedence levels. We need n+1
nonterminals. The first, like factor
in Example 2.6, can never be torn apart. Typically, the production bodies for
this nonterminal are only single operands and parenthesized expressions. Then,
for each precedence level, there is one nonterminal representing expressions
that can be torn apart only by operators at that level or higher. Typically,
the productions for this nonterminal have bodies representing uses of the
operators at that level, plus one body that is just the nonterminal for the
next higher level.
With this grammar, an expression is a list of terms separated by either
+ or - signs, and a term is a list of factors separated by * or / signs. Notice
that any parenthesized expression is a factor, so with parentheses we can
develop expressions that have arbitrarily deep nesting (and arbitrarily deep
Example 2 . 7: Keywords allow us to recognize statements, since most state-ment begin
with a keyword or a special character. Exceptions to this rule include
assignments and procedure calls. The statements defined by the (am-biguous)
grammar in Fig. 2.8 are legal in Java.
In the first production for stmt,
the terminal id represents any
identifier. The productions for expression
are not shown. The assignment statements specified by the first production are
legal in Java, although Java treats = as an assignment operator that can appear
within an expression. For example, Java allows a = b = c, which this
grammar does not.
The nonterminal stmts
generates a possibly empty list of statements. The second production for stmts generates the empty list e. The
first production generates a possibly empty list of statements followed by a
The placement of semicolons is subtle; they appear at the end of every
body that does not end in stmt. This
approach prevents the build-up of semicolons after statements such as if- and
while-, which end with nested substatements. When the nested substatement is an
assignment or a do-while, a semicolon will be generated as part of the
Exercise 2 . 2 . 3: Which of
the grammars in Exercise 2.2.2 are ambiguous?
Exercise 2 . 2 . 4: Construct unambiguous context-free grammars for each of the following languages. In each case
show that your grammar is correct.
Arithmetic expressions in postfix
Left-associative lists of
identifiers separated by commas.
Right-associative lists of
identifiers separated by commas.
Arithmetic expressions of integers and identifiers with the four binary
operators +, -, *, /.
! e) Add unary plus and minus to the arithmetic operators of (d).
Exercise 2 . 2 . 5:
a) Show that all binary strings generated by the following grammar have
values divisible by 3. Hint. Use
induction on the number of nodes in a parse tree.
num ->• 11 | 1001 | numO | num num
b) Does the grammar generate all
binary strings with values divisible by 3?
Exercise 2 . 2 . 6 : Construct a context-free grammar for roman numerals.