Input Buffering
Before discussing the problem of recognizing lexemes in the input, let
us examine some ways that the simple but important task of reading the source
program can be speeded. This task is made difficult by the fact that we often
have to look one or more characters beyond the next lexeme before we can be
sure we have the right lexeme. The box on "Tricky Problems When
Recognizing Tokens" in Section 3.1 gave an extreme example, but there are
many situations where we need to look at least one additional character ahead.
For instance, we cannot be sure we've seen the end of an identifier until we
see a character that is not a letter or digit, and therefore is not part of the
lexeme for id. In C, single-character operators like -, =, or < could also
be the beginning of a two-character operator like ->, ==, or <=. Thus, we
shall introduce a two-buffer scheme that handles large lookaheads safely. We
then consider an improvement involving "sentinels" that saves time
checking for the ends of buffers.
1. Buffer Pairs
Because of the amount of time taken to process characters and the large
number of characters that must be processed during the compilation of a large
source program, specialized buffering techniques have been developed to reduce
the amount of overhead required to process a single input character. An
impor-tant scheme involves two buffers that are alternately reloaded, as
suggested in Fig. 3.3.
Each buffer is of the same size N,
and N is usually the size of a disk
block, e.g., 4096 bytes. Using one system read command we can read N characters into a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of
the source file and is different from any possible character of the source
program.
Two pointers to the input are maintained:
Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
Pointer forward scans ahead until a pattern match is found; the exact
strategy whereby this determination is made will be covered in the balance of
this chapter.
Once the next lexeme is determined, forward
is set to the character at its right end. Then, after the lexeme is recorded as
an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme
just found. In Fig. 3.3, we see forward has passed the end of the next
lexeme, ** (the Fortran exponentiation operator), and must be retracted one
position to its left.
Advancing forward requires
that we first test whether we have reached the end of one of the buffers, and
if so, we must reload the other buffer from the input, and move forward to the beginning of the newly
loaded buffer. As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance we look ahead is
greater than N, we shall never
overwrite the lexeme in its buffer before determining it.
2. Sentinels
If we use the scheme of Section 3.2.1
as described, we must check, each time we advance forward, that we have not moved off one of the buffers; if we do,
then we must also reload the other buffer. Thus, for each character read, we
make two tests: one for the end of the buffer, and one to determine what
character is read (the latter may be a multiway branch). We can combine the
buffer-end test with the test for the current character if we extend each
buffer to hold a sentinel character
at the end. The
sentinel is a special character that
cannot be part of the source program, and a natural choice is the
character eof.
Figure 3.4 shows the same
arrangement as Fig. 3.3, but with
the sentinels added. Note that eof
retains its use as a marker for the end of the entire input. Any eof that appears other than at the end
of a buffer means that the input is at an end. Figure 3.5 summarizes the algorithm for advancing forward. Notice how the first test, which can be part of a multiway
branch based on the character pointed to by forward, is the only test we make, except in the case where we
actually are at the end of a buffer or the end of the input.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.