Top-Down Parsing
1 Recursive-Descent Parsing
2 FIRST and FOLLOW
3 LL(1) Grammars
4 Nonrecursive Predictive Parsing
5 Error Recovery in Predictive
Parsing
6 Exercises for Section 4.4
Top-down parsing can be viewed as the problem of constructing a parse
tree for the input string, starting from the root and creating the nodes of the
parse tree in preorder (depth-first, as discussed in Section 2.3.4).
Equivalently, top-down parsing can be viewed as finding a leftmost derivation
for an input string.
E x a m p l e 4 . 2 7 : The sequence of
parse trees in Fig. 4.12 for the input
id+id*id is a top-down parse according to grammar (4.2), repeated here:
This
sequence of trees corresponds to a leftmost derivation of the input.
At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A. Once an A-production is chosen, the rest of the parsing process
consists of "matching" the terminal symbols in the production body
with the input string.
The section begins with a general form of top-down parsing, called
recursive-descent parsing, which may require backtracking to find the correct
A-produc-tion to be applied. Section 2.4.2 introduced predictive parsing, a
special case of recursive-descent parsing, where no backtracking is required.
Predictive parsing chooses the correct A-production by looking ahead at the
input a fixed number of symbols, typically we may look only at one (that is,
the next input symbol).
For
example, consider the top-down parse in Fig. 4.12, which constructs a tree with
two nodes labeled E'. At the first E' node (in preorder), the production E'
—> +TE' is chosen; at the second E' node, the production E' —> e is
chosen. A predictive parser can choose between E'-productions by looking at the
next input symbol.
The class
of grammars for which we can construct predictive parsers looking k symbols
ahead in the input is sometimes called the LL(k) class. We discuss the LL(1)
class in Section 4.4.3, but introduce certain computations, called FIRST and
FOLLOW, in a preliminary Section 4.4.2. From the FIRST and FOLLOW sets for a
grammar, we shall construct "predictive parsing tables," which make
explicit the choice of production during top-down parsing. These sets are also
useful during bottom-up parsing,
In Section
4.4.4 we give a nonrecursive parsing algorithm that maintains a stack
explicitly, rather than implicitly via recursive calls. Finally, in Section
4.4.5 we discuss error recovery during top-down parsing.
1. Recursive-Descent
Parsing
void A{) {
Choose an A-production, A X1X2 -
• • Xk1
for ( i = 1 to k ) {
if ( X{ is a nonterminal )
call procedure
XiQ;
else if
( Xi equals the current input symbol a )
advance the input to the next symbol;
else /*
an error has occurred */;
}
}
Figure
4.13: A typical procedure for a
nonterminal in a top-down parser
A recursive-descent parsing program consists of a set of procedures, one
for each nonterminal. Execution begins with the procedure for the start symbol,
which halts and announces success if its procedure body scans the entire input
string. Pseudocode for a typical nonterminal appears in Fig. 4.13. Note that
this pseudocode is nondeterministic, since it begins by choosing the
A-production to apply in a manner that is not specified.
General recursive-descent may require backtracking; that is, it may
require repeated scans over the input. However, backtracking is rarely needed
to parse programming language constructs, so backtracking parsers are not seen
fre-quently. Even for situations like natural language parsing, backtracking is
not very efficient, and tabular methods such as the dynamic programming
algo-rithm of Exercise 4.4.9 or the method of Earley (see the bibliographic
notes) are preferred.
To allow backtracking, the code of Fig. 4.13 needs to be modified.
First, we cannot choose a unique A-production at line (1), so we must try each
of several productions in some order. Then, failure at line (7) is not ultimate
failure, but suggests only that we need to return to line (1) and try another A-production.
Only if there are no more A-productions to try do we declare that an input
error has been found. In order to try another A-production, we need to be able
to reset the input pointer to where it was when we first reached line (1).
Thus, a local variable is needed to store this input pointer for future use.
E x a m p l e 4 . 2 9 : Consider the grammar
S -> cAd
A -> a b | a
To
construct a parse tree top-down for the input string w = cad, begin with a tree consisting of a single node labeled S, and the input pointer pointing to c,
the first symbol ofw.S has only one
production, so we use it to expand S
and obtain the tree of Fig. 4.14(a). The leftmost leaf, labeled c, matches the
first symbol of input w, so we
advance the input pointer to a, the second symbol of w, and consider the next leaf, labeled A.
Now, we expand A using the
first alternative A -> a b to
obtain the tree of Fig. 4.14(b). We have a match for the second input symbol,
a, so we advance the input pointer to d, the third input symbol, and compare d against the next leaf, labeled b. Since b does not match d, we
report failure and go back to A to
see whether there is another alternative for A that has not been tried, but that might produce a match.
In going
back to A, we must reset the input
pointer to position 2, the position it had when we first came to A, which means that the procedure for A must store the input pointer in a
local variable.
The
second alternative for A produces the tree of Fig. 4.14(c). The leaf a matches the second symbol
of w and the leaf d matches the third
symbol. Since we have produced a parse tree for w, we halt and announce successful completion of
parsing.
A left-recursive grammar can cause a recursive-descent parser, even one
with backtracking, to go into an infinite loop. That is, when we try to expand
a nonterminal A, we may eventually
find ourselves again trying to expand A
without having consumed any input.
2. FIRST and FOLLOW
The
construction of both top-down and bottom-up parsers is aided by two functions, FIRST and FOLLOW, associated with a grammar G. During top-down parsing, FIRST
and FOLLOW allow us to choose which
production to apply, based on the next input symbol. During panic-mode error
recovery, sets of tokens produced by FOLLOW
can be used as synchronizing tokens.
Example
4.30 : Consider again the non-left-recursive grammar (4.28). Then:
1. FIRST
(F ) = FIRST(T ) = FIRST (E ) = {(,id} . To see why, note that the two
productions for F have bodies that start with these two terminal symbols, id
and the left parenthesis. T has only one production, and its body starts with
F. Since F does not derive e, FIRST (T ) must be the same as FIRST (F). The
same argument covers FIRST (E).
2. FIRST
(E ' ) = {+, e}. The reason is that one of the two productions for E' has a
body that begins with terminal +, and the other's body is e. Whenever a
nonterminal derives e, we place e in FIRST for that nonterminal.
3. FIRST
(T ' ) = {*,e}. The reasoning is analogous to that for FIRST (E ' ) -
4. FOLLOW
( E ) = FOLLOW ( E ' ) = {), $ } . Since E is the start symbol, FOLLOW ( E )
must contain $. The production body ( E ) explains why the right parenthesis is
in FOLLOW ( E ) . For E', note that this nonterminal appears only at the ends
of bodies of E-productions. Thus, FOLLOW ( E' ) must be the same as FOLLOW ( E
) .
5. FOLLOW
( T ) = FOLLOW ( T ' ) = { + , ) , $}. Notice that T appears in bodies only
followed by E'. Thus, everything except e that is in FIRST (E ' ) must be in
FOLLOW ( T ) ; that explains the symbol +. However, since FIRST (E ' ) contains
e (i.e., E' 4* e), and E' is the entire string following T in the bodies of the
E-productions, everything in FOLLOW ( E ) must also be in FOLLOW ( T ) . That
explains the symbols $ and the right parenthesis. As for T", since it
appears only at the ends of the T-productions, it must be that FOLLOW ( T ' ) =
FOLLOW ( T ) .
6. FOLLOW
( F ) = { + , * , ) , $ } . The reasoning is analogous to that for T in point
(5).
3. LL(1) Grammars
Predictive
parsers, that is, recursive-descent parsers needing no backtracking, can be
constructed for a class of grammars called LL(1). The first "L" in
LL(1) stands for scanning the input from left to right, the second
"L" for producing a leftmost derivation, and the " 1 " for
using one input symbol of lookahead at each step to make parsing action
decisions.
Transition Diagrams for Predictive Parsers
Transition diagrams are useful for visualizing predictive parsers. For
exam-ple, the transition diagrams for nonterminals E and E' of grammar
(4.28) appear in Fig. 4.16(a). To construct the transition diagram from a
gram-mar, first eliminate left recursion and then left factor the grammar.
Then, for each nonterminal A,
1. Create an initial and final (return) state.
2. For each production A -> X1X2 ......... Xk, create a path from the
initial to the final state, with edges labeled X1,X2, ...... , If A -» e, the
path is an edge labeled e.
Transition diagrams for predictive parsers differ from those for lexical
analyzers. Parsers have one diagram for each nonterminal. The labels of edges
can be tokens or nonterminals. A transition on a token (terminal)
means that we take that transition if that token is the next input
symbol. A transition on a nonterminal A is a call of the procedure for A.
With an
LL(1) grammar, the ambiguity of whether or not to take an e-edge can be
resolved by making e-transitions the default choice.
Transition diagrams can be simplified, provided the sequence of gram-mar
symbols along paths is preserved. We may also substitute the dia-gram for a
nonterminal A in place of an edge
labeled A. The diagrams in Fig.
4.16(a) and (b) are equivalent: if we trace paths from E to an accept-ing state and substitute for E', then, in both sets of diagrams, the grammar
symbols along the paths make up strings of the form T + T-1 -- h T.
The diagram
in (b) can be obtained from (a) by transformations akin to those in Section
2.5.4, where we used tail-recursion removal and substitution of procedure
bodies to optimize the procedure for a nonterminal.
The class of LL(1) grammars is rich enough to cover most programming
constructs, although care is needed in writing a suitable grammar for the
source language. For example, no left-recursive or ambiguous grammar can be
LL(1).
A grammar G is LL(1) if and
only if whenever A —> a | b sere two distinct productions of G, the following conditions hold:
1. For no
terminal a do both a and b derive strings
beginning with a.
2. At
most one of a and b can derive the empty string.
Figure 4.16: Transition diagrams for nonterminals E and E' of grammar 4.28
The first
two conditions are equivalent to the statement that FIRST (a) and FIRST(b) are
disjoint sets. The third condition is equivalent to stating that if e is in
FIRST(b), then FIRST (a) and FOLLOW(A) are disjoint sets, and likewise if e is
in FIRST (a).
Predictive parsers can be constructed for LL(1) grammars since the proper production to apply for a nonterminal can be selected by looking only at the current input symbol. Flow-of-control constructs, with their distinguishing key-words, generally satisfy the LL(1) constraints. For instance, if we have the productions
then the
keywords if, while, and the symbol {
tell us which alternative is the only one that could possibly succeed if we are
to find a statement.
The next
algorithm collects the information from FIRST and FOLLOW sets into a predictive
parsing table M[A,a], a
two-dimensional array, where A is a nonterminal, and a is a terminal or the
symbol $, the input endmarker. The algorithm is based on the following idea:
the production A —> a is chosen if
the next input symbol a is in FIRST (a) . The only complication occurs when a =
e or, more generally, a =>* e. In this case, we should again choose A -» a, if
the current input symbol is in FOLLOW(A), or if the $ on the input has been
reached and $ is in FOLLOW (A).
Algorith m 4 . 3 1 : Construction of a predictive parsing
table.
INPUT
: Grammar G.
OUTPUT : Parsing
table M.
METHOD : For each
production A -> a
of the grammar, do the following:
1. For each terminal a in FIRST(A), add A -4 a to M[A, a].
2. If e is in F l R S T ( a ) , then for each
terminal 6 in FOLLOW(A), add A -»• a
to M[A,b]. If e is in FIRST (a) and $
is in FOLLOW(A), add A -4 a to M[A, $] as well.
If, after
performing the above, there is no production at all in M[A, a], then set M[A,
a] to e r r o r (which we normally represent by an empty entry in the table).
Example
4.32 : For the expression grammar (4.28), Algorithm 4.31 produces the parsing
table in Fig. 4.17. Blanks are error entries; nonblanks indicate a production
with which to expand a nonterminal.
Consider
production E -> T E ' . Since
F I R S T ( T E ' ) = F I R S T ( T ) = {(,id}
this
production is added to M[E,{] and M[E,
id]. Production E' ->• + T F / is added to
M[E',+] since F I R S T ( + T E '
) = { + } . Since F O L L O W ^ ' )
= { ) , $ } , production E' -+ e is
added to M[E',)] and M[E', $].
Algorithm
4.31 can be applied to any grammar G
to produce a parsing table M. For every LL(1) grammar, each parsing-table entry
uniquely identifies a production or signals an error. For some grammars,
however, M may have some entries that
are multiply defined. For example, if G
is left-recursive or ambiguous, then M
will have at least one multiply defined entry. Although left-recursion
elimination and left factoring are easy to do, there are some grammars for
which no amount of alteration will produce an LL(1) grammar.
The language in the following
example has no LL(1) grammar at all.
Example 4
. 3 3 : The following grammar, which abstracts the dangling-else problem, is
repeated here from Example 4.22:
The parsing table for this
grammar appears in Fig. 4.18. The entry for M[S',
e) contains both S' —> eS and S' —> e.
The
grammar is ambiguous and the ambiguity is manifested by a choice in what
production to use when an e (else) is seen. We can resolve this ambiguity
by choosing S' -» eS. This
choice corresponds to associating an else
with the closest previous then .
Note that the choice S' -> e would
prevent e from ever being put on the stack or removed from the input, and is
surely wrong. •
4. Nonrecursive
Predictive Parsing
A nonrecursive predictive parser can be built by maintaining a stack
explicitly, rather than implicitly via recursive calls. The parser mimics a
leftmost derivation. If w is the
input that has been matched so far, then the stack holds a sequence of grammar
symbols a such that
The table-driven parser in Fig. 4.19 has an input buffer, a stack
containing a sequence of grammar symbols, a parsing table constructed by
Algorithm 4.31, and an output stream. The input buffer contains the string to
be parsed, followed by the endmarker $. We reuse the symbol $ to mark the
bottom of the stack, which initially contains the start symbol of the grammar
on top of $.
The parser is controlled by a program that considers X, the symbol on top of the stack, and a, the current input symbol. If X is a nonterminal, the parser chooses
an X-production by consulting entry M[X,
a] of the parsing table M.
(Additional code could be executed here, for example, code to construct a node
in a parse tree.) Otherwise, it checks for a match between the terminal X and current input symbol a.
The behavior of the parser can be described in terms of its configurations, which give the stack
contents and the remaining input. The next algorithm describes how
configurations are manipulated.
Algorithm 4 . 3 4
: Table-driven predictive parsing.
INPUT : A string w and a parsing table M for grammar G.
OUTPUT : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
METHOD : Initially, the parser is in a configuration with w$ in the
input buffer and the start symbol S of G on top of the stack, above $. The
program in Fig. 4.20 uses the predictive parsing table M to produce a
predictive parse for the input.
set ip to point to the first symbol of w;
set X to the top stack symbol;
w h i le ( X $ ) { /* stack is not empty */
if ( X is a ) pop the stack and
advance ip;
else if ( X is a terminal )
errorQ;
else if ( M[X,a] is an error entry ) errorQ;
else if ( M[X,a]
= X -> Y1Y2 •••Yk) {
output the production X -> Y1
Y2 • • • Yk;
pop the stack;
push Yk, Yk-i,... ,Yi onto the
stack, with Y1 on top;
}
set X to the top stack symbol;
}
Figure 4.20: Predictive parsing
algorithm
Example 4.35 : Consider grammar (4.28); we have already seen its the
parsing table in Fig. 4.17. On input id + id * id, the nonrecursive predictive
parser of Algorithm 4.34 makes the sequence of moves in Fig. 4.21. These moves
correspond to a leftmost derivation (see Fig. 4.12 for the full derivation):
Note that the sentential forms in this derivation correspond to the
input that has already been matched (in column M A T C H E D ) followed by the
stack contents. The matched input is shown only to highlight the
correspondence. For the same reason, the top of the stack is to the left; when
we consider bottom-up parsing, it will be more natural to show the top of the
stack to the right. The input pointer points to the leftmost symbol of the
string in the INPUT column.
5. Error Recovery in Predictive
Parsing
This discussion of error recovery refers to the stack of a table-driven
predictive parser, since it makes explicit the terminals and nonterminals that
the parser hopes to match with the remainder of the input; the techniques can
also be used with recursive-descent parsing.
An error is detected during predictive parsing when the terminal on top
of the stack does not match the next input symbol or when nonterminal A is on
top of the stack, a is the next input symbol, and M[A, a] is error (i.e., the
parsing-table entry is empty).
Panic Mode
Panic-mode error recovery is based on the idea of skipping symbols on
the the input until a token in a selected set of synchronizing tokens appears.
Its effectiveness depends on the choice of synchronizing set. The sets should
be chosen so that the parser recovers quickly from errors that are likely to
occur in practice. Some heuristics are as follows:
As a starting point, place all
symbols in FOLLOW(A) into the synchro-nizing set for nonterminal A. If we skip tokens until an element of
FOLLOW(A) is seen and pop A from the
stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A)
as the synchronizing set for A. For
example, if semicolons terminate statements, as in C, then keywords that begin
statements may not appear in the FOLLOW set of the nontermi-
nal representing expressions. A missing semicolon after an assignment
may therefore result in the keyword beginning the next statement be-ing
skipped. Often, there is a hierarchical structure on constructs in a language;
for example, expressions appear within statements, which ap-pear within blocks,
and so on. We can add to the synchronizing set of a lower-level construct the
symbols that begin higher-level constructs. For example, we might add keywords
that begin statements to the synchro-nizing sets for the nonterminals
generating expressions.
3. If we
add symbols in FIRST(A) to the synchronizing set for nonterminal
A, then it may be possible to resume parsing according to A if a symbol in FIRST (A) appears in the input.
If a nonterminal can generate the
empty string, then the production de-riving e can be used as a default. Doing
so may postpone some error detection, but cannot cause an error to be missed.
This approach reduces the number of nonterminals that have to be considered
during error re-covery.
If a terminal on top of the stack
cannot be matched, a simple idea is to pop the terminal, issue a message saying
that the terminal was inserted, and continue parsing. In effect, this approach
takes the synchronizing set of a token to consist of all other tokens.
E x a m p l e 4 . 3 6 : Using FIRST and FOLLOW symbols as synchronizing
tokens works reasonably well when expressions are parsed according to the usual
gram-mar (4.28). The parsing table for this grammar in Fig. 4.17 is repeated in
Fig. 4.22, with "synch" indicating synchronizing tokens obtained from
the FOLLOW set of the nonterminal in question. The FOLLOW sets for the
non-terminals are obtained from Example 4.30.
The table in Fig. 4.22 is to be used as follows. If the parser looks up
entry M[A,a] and finds that it is
blank, then the input symbol a is
skipped. If the entry is
"synch," then the nonterminal on top of the stack is popped in an
attempt to resume parsing. If a token on top of the stack does not match the
input symbol, then we pop the token from the stack, as mentioned above.
On the erroneous input ) id * + id, the parser and error recovery
mechanism of Fig. 4.22 behave as in Fig. 4.23.
The above discussion of panic-mode recovery does not address the
important issue of error messages. The compiler designer must supply
informative error messages that not only describe the error, they must draw
attention to where the error was discovered.
Phrase - level Recovery
Phrase-level error recovery is implemented by filling in the blank
entries in the predictive parsing table
with pointers to error routines. These routines may change, insert, or delete
symbols on the input and issue appropriate error messages. They may also pop
from the stack. Alteration of stack symbols or the pushing of new symbols onto
the stack is questionable for several reasons. First, the steps carried out by
the parser might then not correspond to the derivation of any word in the
language at all. Second, we must ensure that there is no possibility of an
infinite loop. Checking that any recovery action eventually results in an input
symbol being consumed (or the stack being shortened if the end of the input has
been reached) is a good way to protect against such loops.
6. Exercises for Section 4.4
Exercise 4 . 4 . 1 : For each of the following grammars, devise predictive
parsers and show the parsing tables.
You may left-factor and/of eliminate left-recursion from your grammars first.
The grammar of Exercise 4.2.2(a).
The grammar of Exercise 4.2.2(b).
The grammar of Exercise 4.2.2(c).
The grammar of Exercise 4.2.2(d).
The grammar of Exercise 4.2.2(e).
The grammar of Exercise 4.2.2(g).
Exercise 4 . 4 . 2 : Is it possible, by
modifying the grammar in any way, to con-struct a predictive parser for the
language of Exercise 4.2.1 (postfix expressions with operand a)?
Exercise 4 . 4 . 3
: Compute FIRST and FOLLOW for the grammar of Exercise 4.2.1.
Exercise 4 . 4 . 4 : Compute FIRST and FOLLOW for each of the grammars of Exercise 4.2.2.
Exercise 4 . 4 . 5 : The grammar S -» a S a | a a generates all even-length strings
of a's. We can devise a recursive-descent parser with backtrack for this
grammar. If we choose to expand by production S ->• a a first, then we shall only recognize the string aa. Thus, any reasonable recursive-descent
parser will try S -> a S a first.
a) Show that this recursive-descent parser recognizes inputs aa, aaaa, and aaaaaaaa, but not aaaaaa.
!!
b) What language does this
recursive-descent parser recognize?
The
following exercises are useful steps in the construction of a "Chomsky
Normal Form" grammar from arbitrary grammars, as defined in Exercise
4.4.8.
! Exercise 4 . 4 . 6 : A grammar is e-free if no
production body is e (called an
e-productiori).
Give an algorithm to convert any
grammar into an e-free grammar that generates the same language (with the
possible exception of the empty string — no e-free grammar can generate e).
b) Apply your algorithm to the grammar S aSbS 1 bSaS | e. Hint:
First find all the nonterminals that are nullable,
meaning that they generate e, perhaps by a long derivation.
! Exercise 4 . 4 . 7: A single production is a
production whose body is a single nonterminal,
i.e., a production of the form A -> A.
Give an algorithm to convert any
grammar into an e-free grammar, with no single productions, that generates the
same language (with the possible exception of the empty string) Hint: First eliminate e-productions, and
then find for which pairs of nonterminals A
and B does A =4» B by a sequence of
single productions.
Apply your algorithm to the
grammar (4.1) in Section 4.1.2.
c) Show that, as a consequence of part (a), we can convert a grammar
into an equivalent grammar that has no cycles (derivations of one or more steps
in which A ^ A for some nonterminal A).
Exercise 4 . 4 . 8: A grammar is
said to be in Chomsky Normal Form (CNF) if every production is either of the
form A ->• BC or of the form A ->• a, where A, B, and C are nonterminals,
and a is a terminal. Show how to convert any grammar into a CNF grammar for the
same language (with the possible exception of the empty string — no CNF grammar
can generate e).
Exercise 4 . 4 . 9 : Every language that has a context-free grammar can be recognized in at
most 0 ( n 3 ) time for strings of length n. A simple way to do so, called the
Cocke-Younger-Kasami (or CYK) algorithm is based on dynamic programming. That
is, given a string axa2 • • • a„, we construct an n-by-n table T such that Tij
is the set of nonterminals that generate the substring a{ai+1 •••aj.
If the underlying grammar is in CNF (see Exercise 4.4.8), then one table
entry can be filled in in O(n) time, provided we fill the entries in the proper
order: lowest value ofj-i first. Write an algorithm that correctly fills in the
entries of the table, and show that your algorithm takes 0 ( n 3 ) time. Having
filled in the table, how do you determine whether a1 a2 • • • an is in the
language?
Exercise 4.4.10: Show how, having filled in the table as in Exercise 4.4.9, we can in
0(n) time recover a parse tree for a1a2---an. Hint: modify the table so it
records, for each nonterminal A in each table entry Tij: some pair of
nonterminals in other table entries that justified putting A in Tij.
Exercise 4.4.11 : Modify your algorithm of Exercise 4.4.9 so that it will find, for any
string, the smallest number of insert, delete, and mutate errors (each error a
single character) needed to turn the string into a string in the language of
the underlying grammar.
Exercise 4 . 4 . 1 2 : In Fig.
4.24 is a grammar for certain statements.
You may take e and s to be terminals standing for conditional
expressions and "other
statements," respectively. If we resolve the conflict regarding expansion
of the optional "else" (nonterminal stmtTail) by preferring to consume
an else from the input whenever we see one, we can build a predictive parser
for this grammar. Using the idea of synchronizing symbols described in Section
4.4.5:
Build an error-correcting
predictive parsing table for the grammar.
Show the behavior of your parser
on the following inputs:
(i) if e then s ; if e then s end
( M ) while e do begin s ; if e then s ; end
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.