Why Interprocedural Analysis?
1 Virtual Method Invocation
2 Pointer Alias Analysis
4 Detection of Software Errors and Vulnerabilities
5 SQL Injection
6 Buffer Overflow
Given how hard interprocedural analysis is, let us now address the important problem of why and when we wish to use interprocedural analysis. Although we used constant propagation to illustrate interprocedural analysis, this inter-procedural optimization is neither readily applicable nor particularly beneficial when it does occur. Most of the benefits of constant propagation can be ob-tained simply by performing intraprocedural analysis and inlining procedure calls of the most frequently executed sections of code.
However, there are many reasons why interprocedural analysis is essential. Below, we describe several important applications of interprocedural analysis.
1. Virtual Method Invocation
As mentioned above, object-oriented programs have many small methods. If we only optimize one method at a time, then there are few opportunities for optimization. Resolving method invocation enables optimization. A language like Java dynamically loads its classes. As a result, we do not know at compile-time to which of (perhaps) many methods named m a use of "m" refers in an invocation such as x.mQ.
Many Java implementations use a just-in-time compiler to compile its byte-codes at run time. One common optimization is to profile the execution and determine which are the common receiver types. We can then inline the meth-ods that are most frequently invoked. The code includes a dynamic check on the type and executes the inlined methods if the run-time object has the expected type.
Another approach to resolving uses of a method name m is possible as long as all the source code is available at compile time. Then, it is possible to perform an interprocedural analysis to determine the object types. If the type for a variable x turns out to be unique, then a use of x.mQ can be resolved.
We know exactly what method m refers to in this context. In that case, we can in-line the code for this m, and the compiler does not even have to include a test for the type of x.
2. Pointer Alias Analysis
Even if we do not wish to perform interprocedural versions of the common data-flow analyses like reaching definitions, these analyses can in fact benefit from interprocedural pointer analysis. All the analyses presented in Chapter 9 apply only to local scalar variables that cannot have aliases. However, use of pointers is common, especially in languages like C. By knowing whether pointers can be aliases (can point to the same location), we can improve the accuracy of the techniques from Chapter 9.
Example 1 2 . 9 : Consider the following sequence of three statements, which might form a basic block:
Without knowing if p and q can point to the same location — that is, whether they can be aliases — we cannot conclude that x is equal to 1 at the end of the block. •
As discussed in Chapter 11, the most effective way to parallelize an application is to find the coarsest granularity of parallelism, such as that found in the outermost loops of a program. For this task, interprocedural analysis is of great importance. There is a significant difference between scalar optimiza-tions (those based on values of simple variables, as discussed in Chapter 9) and parallelization. In parallelization, just one spurious data dependence can render an entire loop not parallelizable, and greatly reduce the effectiveness of the optimization. Such amplification of inaccuracies is not seen in scalar optimizations. In scalar optimization, we only need to find the majority of the optimization opportunities. Missing one opportunity or two seldom makes much of a difference.
4. Detection of Software Errors and Vulnerabilities
Interprocedural analysis is not only important for optimizing code. The same techniques can be used to analyze existing software for many kinds of coding errors. These errors can render software unreliable; coding errors that hackers can exploit to take control of, or otherwise damage, a computer system can pose significant security vulnerability risks.
Static analysis is useful in detecting occurrences of many common error patterns. For example, a data item must be guarded by a lock. As another example, disabling an interrupt in the operating system must be followed by a re-enabling of the interrupt. Since a significant source of errors is the incon-sistencies that span procedure boundaries, interprocedural analysis is of great importance. PREfix and Metal are two practical tools that use interprocedural analysis effectively to find many programming errors in large programs. Such tools find errors statically and can improve software reliability greatly. How-ever, these tools are both incomplete and unsound, in the sense that they may not find all errors, and not all reported warnings are real errors. Unfortunately, the interprocedural analysis used is sufficiently imprecise that, were the tools to report all potential errors, the large number of false warnings would render the tools unusable. Nevertheless, even though these tools are not perfect, their systematic use has been shown to greatly improve software reliability.
When it comes to security vulnerabilities, it is highly desirable that we find all the potential errors in a program. In 2006, two of the "most popular" forms of intrusions used by hackers to compromise a system were
1. Lack of input validation on Web applications: SQL injection is one of the most popular forms of such vulnerability whereby hackers gain control of a database by manipulating inputs accepted by web applications.
2. Buffer overflows in C and C + + programs. Because C and C + + do not check if accesses to arrays are in bounds, hackers can write well-crafted strings into unintended areas and hence gain control of the program's execution.
In the next section, we shall discuss how we can use interprocedural analysis to protect programs against such vulnerabilities.
5. SQL Injection
SQL injection refers to the vulnerability where hackers can manipulate user input to a Web application and gain unintended access to a database. For example, banks want their users to be able to make transactions online, provided they supply their correct password. A common architecture for such a system is to have the user enter strings into a Web form, and then to have those strings form part of a database query written in the SQL language. If systems developers are not careful, the strings provided by the user can alter the meaning of the SQL statement in unexpected ways.
Example 1 2 . 1 0 : Suppose a bank offers its customers access to a relation
AcctData(name, password, balance)
That is, this relation is a table of triples, each consisting of the name of a customer, the password, and the balance of the account. The intent is that cus-tomers can see their account balance only if they provide both their name and their correct password. Having a hacker see an account balance is not the worst thing that could occur, but this simple example is typical of more complicated situations where the hacker could execute payments from the account.
The system might implement a balance inquiry as follows:
1. Users invoke a Web form where they enter their name and password.
2. The name is copied to a variable n and the password to a variable p.
3. Later, perhaps in some other procedure, the following SQL query is exe-cuted:
SELECT b a l a n c e FROM AcctData
WHERE name = 5 : n ' and password = ' : p '
For readers not familiar with SQL, this query says: "Find in the table AcctData a row with the first component (name) equal to the string currently in variable n and the second component (password) equal to the string currently in variable p; print the third component (balance) of that row." Note that SQL uses single
quotes, not double quotes, to delimit strings, and the colons in front of n and
p indicate that they are variables of the surrounding language.
Suppose the hacker, who wants to find Charles Dickens' account balance,
supplies the following values for the strings n and p:
n = Charles Dickens' — p = who cares
The effect of these strange strings is to convert the query into
SELECT balance FROM AcctData
WHERE name = ' Charles Dickens ' —' and password = 'who cares '
In many database systems — is a comment-introducing token and has the effect of making whatever follows on that line a comment. As a result, the query now asks the database system to print the balance for every person whose name is
' C h a r l e s D i c k e n s ' , regardless of the password that appears with that name in a name-password-balance triple. That is, with comments eliminated, the query is:
SELECT b a l a n c e FROM AcctData
WHERE name = ' C h a r l e s Dickens'
In Example 12.10, the "bad" strings were kept in two variables, which might be passed between procedures. However, in more realistic cases, these strings might be copied several times, or combined with others to form the full query. We cannot hope to detect coding errors that create SQL-injection vulnerabilities without doing a full interprocedural analysis of the entire program.
6. Buffer Overflow
A buffer overflow attack occurs when carefully crafted data supplied by the user writes beyond the intended buffer and manipulates the program execution. For example, a C program may read a string s from the user, and then copy it into a buffer b using the function call:
If the string s is actually longer than the buffer b, then locations that are not part of b will have their values changed. That in itself will probably cause the program to malfunction or at least to produce the wrong answer, since some data used by the program will have been changed.
But worse, the hacker who chose the string s can pick a value that will do more than cause an error. For example, if the buffer is on the run-time stack, then it is near the return address for its function. An insidiously chosen value of s may overwrite the return address, and when the function returns, it goes to a place chosen by the hacker. If hackers have detailed knowledge of the surrounding operating system and hardware, they may be able to execute a command that will give them control of the machine itself. In some situations, they may even have the ability to have the false return address transfer control to code that is part of the string s, thus allowing any sort of program to be inserted into the executing code.
To prevent buffer overflows, every array-write operation must be statically proven to be within bounds, or a proper array-bounds check must be performed dynamically. Because these bounds checks need to be inserted by hand in C and C + + programs, it is easy to forget to insert the test or to get the test wrong. Heuristic tools have been developed that will check if at least some test, though not necessarily a correct test, has been performed before a strcpy is called.
Dynamic bounds checking is unavoidable because it is impossible to deter-mine statically the size of users' input. All a static analysis can do is assure that the dynamic checks have been inserted properly. Thus, a reasonable strategy is to have the compiler insert dynamic bounds checking on every write, and use static analysis as a means to optimize away as many bounds check as possible. It is no longer necessary to catch every potential violation; moreover, we only need to optimize only those code regions that execute frequently.
Inserting bounds checking into C programs is nontrivial, even if we do not mind the cost. A pointer may point into the middle of some array, and we do not know the extent of that array. Techniques have been developed to keep track of the extent of the buffer pointed to by each pointer dynamically. This information allows the compiler to insert array bounds checks for all accesses. Interestingly enough, it is not advisable to halt a program whenever a buffer overflow is detected. In fact, buffer overflows do occur in practice, and a pro-gram would likely fail if we disable all buffer overflows. The solution is to extend the size of the array dynamically to accommodate for the buffer overruns.
Interprocedural analysis can be used to speed up the cost of dynamic ar ray bounds checks. For example, suppose we are interested only in catching buffer overflows involving user-input strings, we can use static analysis to de-termine which variables may hold contents provided by the user. Like SQL injection, being able to track an input as it is copied across procedures is useful in eliminating unnecessary bounds checks.