Chapter: Compilers : Principles, Techniques, & Tools : Introduction

Programming Language Basics

1 The Static/Dynamic Distinction 2 Environments and States 3 Static Scope and Block Structure 4 Explicit Access Control 5 Dynamic Scope 6 Parameter Passing Mechanisms 7 Aliasing 8 Exercises for Section

Programming Language Basics

1 The Static/Dynamic Distinction

2 Environments and States

3 Static Scope and Block Structure

4 Explicit Access Control

5 Dynamic Scope

6 Parameter Passing Mechanisms

7 Aliasing

8 Exercises

In this section, we shall cover the most important terminology and distinctions that appear in the study of programming languages. It is not our purpose to cover all concepts or all the popular programming languages. We assume that the reader is familiar with at least one of C, C + + , C # , or Java, and may have encountered other languages as well.

The Static/Dynamic Distinction

Among the most important issues that we face when designing a compiler for a language is what decisions can the compiler make about a program. If a language uses a policy that allows the compiler to decide an issue, then we say that the language uses a static policy or that the issue can be decided at compile time. On the other hand, a policy that only allows a decision to be made when

we execute the program is said to be a dynamic policy or to require a decision

at run time.

One issue on which we shall concentrate is the scope of declarations. The scope of a declaration of x is the region of the program in which uses of x refer to this declaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope. With dynamic scope, as the program runs, the same use of x could refer to any of several different declarations of x.

Most languages, such as C and Java, use static scope. We shall discuss static

scoping in Section 1.6.3.

E x a m p l e 1.3 : As another example of the static/dynamic distinction, consider

the use of the term "static" as it applies to data in a Java class declaration. In Java, a variable is a name for a location in memory used to hold a data value. Here, "static" refers not to the scope of the variable, but rather to the ability of the compiler to determine the location in memory where the declared variable can be found. A declaration like

public static int x;

makes x a class variable and says that there is only one copy of x, no matter how many objects of this class are created. Moreover, the compiler can determine a location in memory where this integer x will be held. In contrast, had "static" been omitted from this declaration, then each object of the class would have its own location where x would be held, and the compiler could not determine all these places in advance of running the program. •

Environments and States

Another important distinction we must make when discussing programming languages is whether changes occurring as the program runs affect the values of data elements or affect the interpretation of names for that data. For example, the execution of an assignment such as x = y +1 changes the value denoted by the name x. More specifically, the assignment changes the value in whatever location is denoted by x.

It may be less clear that the location denoted by x can change at run time.

For instance, as we discussed in Example 1.3, if x is not a static (or "class") variable, then every object of the class has its own location for an instance of variable x. In that case, the assignment to x can change any of those "instance" variables, depending on the object to which a method containing that assignment is applied.

The association of names with locations in memory (the store) and then with values can be described by two mappings that change as the program runs (see Fig. 1.8):

The environment is a mapping from names to locations in the store. Since variables refer to locations ("1-values" in the terminology of C), we could alternatively define an environment as a mapping from names to variables.

The state is a mapping from locations in store to their values. That is, the state maps 1-values to their corresponding r-values, in the terminology of C.

Environments change according to the scope rules of a language.

E x a m p l e 1.4: Consider the C program fragment in Fig. 1.9. Integer i is declared a global variable, and also declared as a variable local to function /. When / is executing, the environment adjusts so that name i refers to the location reserved for the i that is local to /, and any use of i, such as the assignment i = 3 shown explicitly, refers to that location. Typically, the local i is given a place on the run-time stack.

Whenever a function g other than / is executing, uses of i cannot refer to the i that is local to /. Uses of name i in g must be within the scope of some other declaration of i. An example is the explicitly shown statement x = i+1, which is inside some procedure whose definition is not shown. The i in i + 1 presumably refers to the global i. As in most languages, declarations in C must precede their use, so a function that comes before the global i cannot refer to it. •

The environment and state mappings in Fig. 1.8 are dynamic, but there are a few exceptions:

1. Static versus dynamic binding of names to locations. Most binding of names to locations is dynamic, and we discuss several approaches to this binding throughout the section. Some declarations, such as the global i in Fig. 1.9, can be given a location in the store once and for all, as the compiler generates object code.²

Static versus dynamic binding of locations to values. The binding of lo-cations to values (the second stage in Fig. 1.8), is generally dynamic as well, since we cannot tell the value in a location until we run the program. Declared constants are an exception. For instance, the C definition

#define ARRAYSIZE 1000

Names, Identifiers, and Variables

Although the terms "name" and "variable," often refer to the same thing, we use them carefully to distinguish between compile-time names and the run-time locations denoted by names.

An identifier is a string of characters, typically letters or digits, that refers to (identifies) an entity, such as a data object, a procedure, a class, or a type. All identifiers are names, but not all names are identifiers. Names can also be expressions. For example, the name x.y might denote the field y of a structure denoted by x. Here, x and y are identifiers, while x.y is a name, but not an identifier. Composite names like x.y are called qualified names.

A variable refers to a particular location of the store. It is common for the same identifier to be declared more than once; each such declaration introduces a new variable. Even if each identifier is declared just once, an identifier local to a recursive procedure will refer to different locations of the store at different times.

binds the name ARRAYS IZE to the value 1000 statically. We can determine this binding by looking at the statement, and we know that it is impossible for this binding to change when the program executes.

Static Scope and Block Structure

Most languages, including C and its family, use static scope. The scope rules for C are based on program structure; the scope of a declaration is determined implicitly by where the declaration appears in the program. Later languages, such as C + + , Java, and C # , also provide explicit control over scopes through the use of keywords like public, private, and protected .

In this section we consider static-scope rules for a language with blocks, where a block is a grouping of declarations and statements. C uses braces { and } to delimit a block; the alternative use of begin and end for the same purpose dates back to Algol.

E x a m p l e 1.5 : To a first approximation, the C static-scope policy is as follows:

A C program consists of a sequence of top-level declarations of variables and functions.

Functions may have variable declarations within them, where variables include local variables and parameters. The scope of each such declaration is restricted to the function in which it appears.

Procedures, Functions, and Methods

To avoid saying "procedures, functions, or methods," each time we want to talk about a subprogram that may be called, we shall usually refer to all of them as "procedures." The exception is that when talking explicitly of programs in languages like C that have only functions, we shall refer to them as "functions." Or, if we are discussing a language like Java that has only methods, we shall use that term instead.

A function generally returns a value of some type (the "return type"), while a procedure does not return any value. C and similar languages, which have only functions, treat procedures as functions that have a special return type "void," to signify no return value. Object-oriented languages like Java and C + + use the term "methods." These can behave like either functions or procedures, but are associated with a particular class.

3. The scope of a top-level declaration of a name x consists of the entire program that follows, with the exception of those statements that lie within a function that also has a declaration of x.

The additional detail regarding the C static-scope policy deals with variable declarations within statements. We examine such declarations next and in Example 1.6. •

In C, the syntax of blocks is given by

One type of statement is a block. Blocks can appear anywhere that other types of statements, such as assignment statements, can appear.

A block is a sequence of declarations followed by a sequence of statements, all surrounded by braces.

Note that this syntax allows blocks to be nested inside each other. This nesting property is referred to as block structure. The C family of languages has block structure, except that a function may not be defined inside another function.

We say that a declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is located within B, but not within any block that is nested within B.

The static-scope rule for variable declarations in a block-structured lan-guages is as follows. If declaration D of name x belongs to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within J5, in which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.

An equivalent way to express this rule is to focus on a use of a name x. Let Bi, i?2, • • • , Bk be all the blocks that surround this use of x, with Bk the smallest, nested within Bk-i, which is nested within Bk-2, and so on. Search for the largest i such that there is a declaration of x belonging to B^. This use of x refers to the declaration in B{. Alternatively, this use of x is within the scope of the declaration in Bi.

E x a m p l e 1.6: The C + + program in Fig. 1.10 has four blocks, with several definitions of variables a and b. As a memory aid, each declaration initializes its variable to the number of the block to which it belongs.

For instance, consider the declaration i n t a = 1 in block Bi. Its scope is all of Bi, except for those blocks nested (perhaps deeply) within B± that have their own declaration of a. Z?2, nested immediately within B±, does not have a declaration of a, but B3 does. B4 does not have a declaration of a, so block B3 is the only place in the entire program that is outside the scope of the declaration of the name a that belongs to B\. That is, this scope includes B4 and all of Z?2 except for the part of B2 that is within B3. The scopes of all five declarations are summarized in Fig. 1.11.

From another point of view, let us consider the output statement in block B4 and bind the variables a and b used there to the proper declarations. The list of surrounding blocks, in order of increasing size, is B4,B2,Bi. Note that #3 does not surround the point in question. B4 has a declaration of b, so it is to this declaration that this use of b refers, and the value of 6 printed is 4. However, B4 does not have a declaration of o, so we next look at £?2- That block does not have a declaration of a either, so we proceed to B\. Fortunately, there is a declaration i n t a = 1 belonging to that block, so the value of a printed is 1. Had there been no such declaration, the program would have been erroneous.

Explicit Access Control

Classes and structures introduce a new scope for their members. If p is an object of a class with a field (member) x, then the use of x in p.x refers to field x in the class definition. In analogy with block structure, the scope of a member declaration x in a class C extends to any subclass C, except if C has a local declaration of the same name x.

Through the use of keywords like public, private, and protected, object-oriented languages such as C + + or Java provide explicit control over access to member names in a superclass. These keywords support encapsulation by restricting access. Thus, private names are purposely given a scope that includes only the method declarations and definitions associated with that class and any "friend" classes (the C + + term). Protected names are accessible to subclasses. Public names are accessible from outside the class.

In C + + , a class definition may be separated from the definitions of some or all of its methods. Therefore, a name x associated with the class C may have a region of the code that is outside its scope, followed by another region (a method definition) that is within its scope. In fact, regions inside and outside the scope may alternate, until all the methods have been defined.

Dynamic Scope

Technically, any scoping policy is dynamic if it is based on factor(s) that can be known only when the program executes. The term dynamic scope, however, usually refers to the following policy: a use of a name x refers to the declaration of x in the most recently called procedure with such a declaration. Dynamic scoping of this type appears only in special situations. We shall consider two ex-amples of dynamic policies: macro expansion in the C preprocessor and method resolution in object-oriented programming.

Declarations and Definitions

The apparently similar terms "declaration" and "definition" for programming-language concepts are actually quite different. Declarations tell us about the types of things, while definitions tell us about their values. Thus, i n t i is a declaration of i, while i = 1 is a definition of i.

The difference is more significant when we deal with methods or other procedures. In C + + , a method is declared in a class definition, by giving the types of the arguments and result of the method (often called the signature for the method. The method is then defined, i.e., the code for executing the method is given, in another place. Similarly, it is common to define a C function in one file and declare it in other files where the function is used.

Example 1.7: In the C program of Fig. 1.12, identifier a is a macro that stands for expression (a? + 1). But what is xl We cannot resolve x statically, that is, in terms of the program text.

#define a (x+1)

i n t x = 2;

void b ( ) { i n t x = 1; p r i n t f ("0 /.d\n", a ) ; }

void c() { printf (" , / . d\n", a ) ; }

void mainO { b ( ) ; c ( ) ; }

Figure 1.12: A macro whose names must be scoped dynamically

In fact, in order to interpret x, we must use the usual dynamic-scope rule.

We examine all the function calls that are currently active, and we take the most recently called function that has a declaration of It is to this declaration that the use of x refers.

In the example of Fig. 1.12, the function main first calls function 6. As b executes, it prints the value of the macro a. Since (x + 1) must be substituted for a, we resolve this use of x to the declaration i n t x=l in function b. The reason is that b has a declaration of x, so the (x + 1) in the p r i n t f in b refers to this x. Thus, the value printed is 1.

After b finishes, and c is called, we again need to print the value of macro a. However, the only x accessible to c is the global x. The p r i n t f statement in c thus refers to this declaration of x, and value 2 is printed.

Dynamic scope resolution is also essential for polymorphic procedures, those that have two or more definitions for the same name, depending only on the types of the arguments. In some languages, such as ML (see Section 7.3.3), it is possible to determine statically types for all uses of names, in which case the compiler can replace each use of a procedure name p by a reference to the code for the proper procedure. However, in other languages, such as Java and C + + , there are times when the compiler cannot make that determination.

Analogy Between Static and Dynamic Scoping

While there could be any number of static or dynamic policies for scoping, there is an interesting relationship between the normal (block-structured) static scoping rule and the normal dynamic policy. In a sense, the dynamic rule is to time as the static rule is to space. While the static rule asks us to find the declaration whose unit (block) most closely surrounds the physical location of the use, the dynamic rule asks us to find the declaration whose unit (procedure invocation) most closely surrounds the time of the use.

Example 1 . 8 : A distinguishing feature of object-oriented programming is the ability of each object to invoke the appropriate method in response to a message. In other words, the procedure called when x.mQ is executed depends on the class of the object denoted by x at that time. A typical example is as follows:

1. There is a class C with a method named m() .

2. D is a subclass of C, and D has its own method named ra().

3. There is a use of m of the form x.mQ, where x is an object of class C.

Normally, it is impossible to tell at compile time whether x will be of class C or of the subclass D. If the method application occurs several times, it is highly likely that some will be on objects denoted by x that are in class C but not D, while others will be in class D. It is not until run-time that it can be decided which definition of m is the right one. Thus, the code generated by the compiler must determine the class of the object x, and call one or the other method named m. •

Parameter Passing Mechanisms

All programming languages have a notion of a procedure, but they can differ in how these procedures get their arguments. In this section, we shall consider how the actual parameters (the parameters used in the call of a procedure) are associated with the formal parameters (those used in the procedure defi-nition). Which mechanism is used determines how the calling-sequence code treats parameters. The great majority of languages use either "call-by-value," or "call-by-reference," or both. We shall explain these terms, and another method known as "call-by-name," that is primarily of historical interest.

Call - by - Value

In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if it is a variable). The value is placed in the location belonging to the corresponding formal parameter of the called procedure. This method is used in C and Java, and is a common option in C + + , as well as in most other languages. Call-by-value has the effect that all computation involving the formal parameters done by the called procedure is local to that procedure, and the actual parameters themselves cannot be changed.

Note, however, that in C we can pass a pointer to a variable to allow that variable to be changed by the callee. Likewise, array names passed as param eters in C, C + + , or Java give the called procedure what is in effect a pointer or reference to the array itself. Thus, if a is the name of an array of the calling procedure, and it is passed by value to corresponding formal parameter x, then an assignment such as x [ i ] = 2 really changes the array element a[2]. The reason is that, although x gets a copy of the value of a, that value is really a pointer to the beginning of the area of the store where the array named a is located.

Similarly, in Java, many variables are really references, or pointers, to the things they stand for. This observation applies to arrays, strings, and objects of all classes. Even though Java uses call-by-value exclusively, whenever we pass the name of an object to a called procedure, the value received by that procedure is in effect a pointer to the object. Thus, the called procedure is able to affect the value of the object itself.

Call - by - Reference

In call-by-reference, the address of the actual parameter is passed to the callee as the value of the corresponding formal parameter. Uses of the formal parameter in the code of the callee are implemented by following this pointer to the location indicated by the caller. Changes to the formal parameter thus appear as changes to the actual parameter.

If the actual parameter is an expression, however, then the expression is evaluated before the call, and its value stored in a location of its own. Changes to the formal parameter change this location, but can have no effect on the data of the caller.

Call-by-reference is used for "ref" parameters in C + + and is an option in many other languages. It is almost essential when the formal parameter is a large object, array, or structure. The reason is that strict call-by-value requires that the caller copy the entire actual parameter into the space belonging to the corresponding formal parameter. This copying gets expensive when the parameter is large. As we noted when discussing call-by-value, languages such as Java solve the problem of passing arrays, strings, or other objects by copying only a reference to those objects. The effect is that Java behaves as if it used call-by-reference for anything other than a basic type such as an integer or real.

Call - by - Name

A third mechanism — call-by-name — was used in the early programming language Algol 60. It requires that the callee execute as if the actual parameter were substituted literally for the formal parameter in the code of the callee, as if the formal parameter were a macro standing for the actual parameter (with renaming of local names in the called procedure, to keep them distinct). When the actual parameter is an expression rather than a variable, some unintuitive behaviors occur, which is one reason this mechanism is not favored today.

Aliasing

There is an interesting consequence of call-by-reference parameter passing or its simulation, as in Java, where references to objects are passed by value. It is possible that two formal parameters can refer to the same location; such variables are said to be aliases of one another. As a result, any two variables, which may appear to take their values from two distinct formal parameters, can become aliases of each other, as well.

Example 1. 9 : Suppose a is an array belonging to a procedure p, and p calls another procedure q(x,y) with a call q(a,a). Suppose also that parameters are passed by value, but that array names are really references to the location where the array is stored, as in C or similar languages. Now, x and y have become aliases of each other. The important point is that if within q there is an assignment x[10] = 2, then the value of y[10] also becomes 2.

It turns out that understanding aliasing and the mechanisms that create it is essential if a compiler is to optimize a program. As we shall see starting in Chapter 9, there are many situations where we can only optimize code if we can be sure certain variables are not aliased. For instance, we might determine that x = 2 is the only place that variable x is ever assigned. If so, then we can replace a use of x by a use of 2; for example, replace a = x+3 by the simpler a = 5. But suppose there were another variable y that was aliased to x. Then an assignment y = 4 might have the unexpected effect of changing x. It might also mean that replacing a = x+3 by a = 5 was a mistake; the proper value of could be 7 there.

Exercises for Section 1.6

Exercise 1 . 6 . 1 : For the block-structured C code of Fig. 1.13(a), indicate the values assigned to w, x, y, and z.

Exercise 1 . 6 . 2: Repeat Exercise 1.6.1 for the code of Fig. 1.13(b).

Exercise 1 . 6 . 3 : For the block-structured code of Fig. 1.14, assuming the usual static scoping of declarations, give the scope for each of the twelve declarations.

Summary of Chapter 1

Language Processors. An integrated software development environment includes many different kinds of language processors such as compilers, interpreters, assemblers, linkers, loaders, debuggers, profilers.

Compiler Phases. A compiler operates as a sequence of phases, each of which transforms the source program from one intermediate representa-tion to another.

4> Machine and Assembly Languages. Machine languages were the first-generation programming languages, followed by assembly languages. Pro - gramming in these languages was time consuming and error prone.

Modeling in Compiler Design. Compiler design is one of the places where theory has had the most impact on practice. Models that have been found useful include automata, grammars, regular expressions, trees, and many others.

4 Code Optimization. Although code cannot truly be "optimized," the sci-ence of improving the efficiency of code is both complex and very impor-tant. It is a major portion of the study of compilation.

4 Higher-Level Languages. As time goes on, programming languages take on progressively more of the tasks that formerly were left to the program-mer, such as memory management, type-consistency checking, or parallel execution of code.

4 Compilers and Computer Architecture. Compiler technology influences computer architecture, as well as being influenced by the advances in architecture. Many modern innovations in architecture depend on compilers being able to extract from source programs the opportunities to use the hardware capabilities effectively.

4 Software Productivity and Software Security. The same technology that allows compilers to optimize code can be used for a variety of program-analysis tasks, ranging from detecting common program bugs to discovering that a program is vulnerable to one of the many kinds of intrusions that "hackers" have discovered.

4 Scope Rules. The scope of a declaration of x is the context in which uses of x refer to this declaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope.

Environments. The association of names with locations in memory and then with values can be described in terms of environments, which map names to locations in store, and states, which map locations to their values.

4 Block Structure. Languages that allow blocks to be nested are said to have block structure. A name a: in a nested block B is in the scope of a declaration D of x in an enclosing block if there is no other declaration of x in an intervening block.

4 Parameter Passing. Parameters are passed from a calling procedure to the callee either by value or by reference. When large objects are passed by value, the values passed are really references to the objects themselves, resulting in an effective call-by-reference.

Aliasing. When parameters are (effectively) passed by reference, two for-mal parameters can refer to the same object. This possibility allows a change in one variable to change another.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Compilers : Principles, Techniques, & Tools : Introduction : Programming Language Basics |