Home | | Multi - Core Architectures and Programming | The parallel For Directive

# The parallel For Directive

As an alternative to our explicit parallelization of the trapezoidal rule, OpenMP pro-vides the parallel for directive.

THE parallel for DIRECTIVE

As an alternative to our explicit parallelization of the trapezoidal rule, OpenMP pro-vides the parallel for directive. Using it, we can parallelize the serial trapezoidal rule

h = (b-a)/n;

approx = (f(a) + f(b))/2.0;

for (i = 1; i <= n-1; i++)

approx += f(a + i *h);

approx = h*approx;

by simply placing a directive immediately before the for loop:

h = (b-a)/n;

approx = (f(a) + f(b))/2.0;

# pragma omp parallel

reduction(+: approx)

for (i = 1; i <= n-1; i++)

approx += f(a + i*h);

approx = h*approx;

Like the parallel directive, the parallel for directive forks a team of threads to execute the following structured block. However, the structured block following the parallel for directive must be a for loop. Furthermore, with the parallel for directive the system parallelizes the for loop by dividing the iterations of the loop among the threads. The parallel for directive is therefore very different from the parallel directive, because in a block that is preceded by a parallel directive, in general, the work must be divided among the threads by the threads themselves.

In a for loop that has been parallelized with a parallel for directive, the default partitioning, that is, of the iterations among the threads is up to the sys-tem. However, most systems use roughly a block partitioning, that is, if there are m iterations, then roughly the first m/thread count are assigned to thread 0, the next m/thread count are assigned to thread 1, and so on.

Note that it was essential that we made approx a reduction variable. If we hadn’t, it would have been an ordinary shared variable, and the body of the loop

approx += f(a + i*h);

would be an unprotected critical section.

However, speaking of scope, the default scope for all variables in a parallel directive is shared, but in our parallel for if the loop variable i were shared, the variable update, i++, would also be an unprotected critical section. Hence, in a loop that is parallelized with a parallel for directive, the default scope of the loop variable is private; in our code, each thread in the team has its own copy of i.

1. Caveats

This is truly wonderful: It may be possible to parallelize a serial program that consists of one large for loop by just adding a single parallel for directive. It may be possible to incrementally parallelize a serial program that has many for loops by successively placing parallel for directives before each loop.

However, things may not be quite as rosy as they seem. There are several caveats associated with the use of the parallel for directive. First, OpenMP will only par-allelize for loops. It won’t parallelize while loops or do while loops. This may not seem to be too much of a limitation, since any code that uses a while loop or a do while loop can be converted to equivalent code that uses a for loop instead.

However, OpenMP will only parallelize for loops for which the number of iterations can be determined

. from the for statement itself (that is, the code for (. . . ; . . . ; . . .)),

. andprior to execution of the loop.

For example, the “infinite loop”

for ( ; ; ) {

. . .

}

cannot be parallelized. Similarly, the loop

for (i = 0; i < n; i++) {

if ( . . . ) break;

. . .

}

cannot be parallelized, since the number of iterations can’t be determined from the for statement alone. This for loop is also not a structured block, since the break adds another point of exit from the loop.

In fact, OpenMP will only parallelize for loops that are in canonical form. Loops in canonical form take one of the forms shown in Program 5.3. The variables and expressions in this template are subject to some fairly obvious restrictions:

.. The variable index must have integer or pointer type (e.g., it can’t be a float). The expressions start, end, and incr must have a compatible type. For example, if index is a pointer, then incr must have integer type.

The expressions start, end, and incr must not change during execution of the loop.

During. execution of the loop, the variable index can only be modified by the “increment expression” in the for statement. These restrictions allow the run-time system to determine the number of iterations prior to execution of the loop.

The sole exception to the rule that the run-time system must be able to determine the number of iterations prior to execution is that there can be a call to exit in the body of the loop.

2. Data dependences

If a for loop fails to satisfy one of the rules outlined in the preceding section, the compiler will simply reject it. For example, suppose we try to compile a program with the following linear search function:

int Linear search(int key, int A[], int n) {

int i;

for (i = 0; i < n; i++)

if (A[i] == key) return i;

return -1;  /*  key not in list  */

}

The gcc compiler reports:

Line 6: error: invalid exit from OpenMP structured block

A more insidious problem occurs in loops in which the computation in one iter-ation depends on the results of one or more previous iterations. As an example, consider the following code, which computes the first n fibonacci numbers:

fibo = fibo = 1;

for (i = 2; i < n; i++)

fibo[i] = fibo[i-1] + fibo[i-2];

Although we may be suspicious that something isn’t quite right, let’s try parallellizing the for loop with a parallel for directive:

fibo = fibo = 1;

for (i = 2; i < n; i++)

fibo[i] = fibo[i-1] + fibo[i-2];

The compiler will create an executable without complaint. However, if we try running it with more than one thread, we may find that the results are, at best, unpredictable. For example, on one of our systems if we try using two threads to compute the first 10 Fibonacci numbers, we sometimes get

1 1 2 3 5 8 13 21 34 55,

which is correct. However, we also occasionally get

1 1 2 3 5 8 0 0 0 0.

What happened? It appears that the run-time system assigned the computation of fibo, fibo, fibo, and fibo to one thread, while fibo, fibo, fibo, and fibo were assigned to the other. (Remember the loop starts with i = 2.) In some runs of the program, everything is fine because the thread that was assigned fibo, fibo, fibo, and fibo finishes its computations before the other thread starts. However, in other runs, the first thread has evidently not computed fibo and fibo when the second computes fibo. It appears that the system has initialized the entries in fibo to 0, and the second thread is using the values fibo = 0 and fibo = 0 to compute fibo. It then goes on to use fibo = 0 and fibo = 0 to compute fibo, and so on.

We see two important points here:

OpenMP compilers don’t check for dependences among iterations in a loop that’s being parallelized with a parallel for directive. It’s up to us, the programmers, to identify these dependences.

A loop in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by OpenMP.

The dependence of the computation of fibo on the computation of fibo is called a data dependence. Since the value of fibo is calculated in one iteration, and the result is used in a subsequent iteration, the dependence is sometimes called a loop-carried dependence.

3. Finding loop-carried dependences

Perhaps the first thing to observe is that when we’re attempting to use a parallel for directive, we only need to worry about loop-carried dependences. We don’t need to worry about more general data dependences. For example, in the loop

for (i = 0; i < n; i++) {

x[i] = a + i*h;

y[i] = exp(x[i]);

}

there is a data dependence between Lines 2 and 3. However, there is no problem with the parallelization

for (I = 0; i < n; i++)      {

x[i] = a + i h;

y[i] = exp(x[i]);

}

since the computation of x[i] and its subsequent use will always be assigned to the same thread.

Also observe that at least one of the statements must write or update the variable in order for the statements to represent a dependence, so in order to detect a loop-carried dependence, we should only concern ourselves with variables that are updated by the loop body. That is, we should look for variables that are read or written in one iteration, and written in another. Let’s look at a couple of examples.

4. Estimating

One way to get a numerical approximation to     is to use many terms in the formula We can implement this formula in serial code with

double factor = 1.0;

double sum = 0.0;

for (k = 0; k < n; k++) {

sum += factor/(2* k+1);

factor =  factor;

}

pi_approx = 4.0* sum;

(Why is it important that factor is a double instead of an int or a long?)

How can we parallelize this with OpenMP? We might at first be inclined to do something like this:

1          double factor = 1.0;

double sum = 0.0;

reduction(+:sum)

for (k = 0; k < n; k++) {

sum += factor/(2*k+1);

factor =  factor;

}

pi_approx = 4.0 sum;

However, it’s pretty clear that the update to factor in Line 7 in iteration k and the subsequent increment of sum in Line 6 in iteration k+1 is an instance of a loop-carried dependence. If iteration k is assigned to one thread and iteration k+1 is assigned to another thread, there’s no guarantee that the value of factor in Line 6 will be correct. In this case we can fix the problem by examining the series We see that in iteration k the value of factor should be (- 1)k, which is +1 if k is even and -1 if k is odd, so if we replace the code we will eliminate the loop dependency.

However, things still aren’t quite right. If we run the program on one of our systems with just two threads and n = 1000, the result is consistently wrong. For example,

With n = 1000 terms and 2 threads,

2                   Our estimate of pi = 2.97063289263385

With n = 1000 terms and 2 threads,

Our estimate of pi = 3.22392164798593

On the other hand, if we run the program with only one thread, we always get

With n = 1000 terms and 1 threads,

Our estimate of pi = 3.14059265383979

What’s wrong here?

Recall that in a block that has been parallelized by a parallel for directive, by default any variable declared before the loop—with the sole exception of the loop variable—is shared among the threads. So factor is shared and, for exam-ple, thread 0 might assign it the value 1, but before it can use this value in the update to sum, thread 1 could assign it the value 1. Therefore, in addition to eliminating the loop-carried dependence in the calculation of factor, we need to insure that each thread has its own copy of factor. That is, in order to make our code correct, we need to also insure that factor has private scope. We can do this by adding a private clause to the parallel for directive.

double sum = 0.0;

reduction(+:sum)  private(factor)

for (k = 0; k < n; k++) f

if (k % 2 == 0)

factor = 1.0;

else

factor =  1.0;

sum += factor/(2 k+1);

g

The private clause specifies that for each variable listed inside the parentheses, a private copy is to be created for each thread. Thus, in our example, each of the thread count threads will have its own copy of the variable factor, and hence the updates of one thread to factor won’t affect the value of factor in another thread.

It’s important to remember that the value of a variable with private scope is unspecified at the beginning of a parallel block or a parallel for block. Its value is also unspecified after completion of a parallel or parallel for block. So, for example, the output of the first printf statement in the following code is unspeci-fied, since it prints the private variable x before it’s explicitly initialized. Similarly, the output of the final printf is unspecified, since it prints x after the completion of the parallel block.

int x = 5;

private(x)

f

printf("Thread %d > before initialization, x = %d\n",

my_rank, x);

x = 2* my_rank + 2;

printf("Thread %d > after initialization, x = %d\n",

my_rank, x);

}

printf("After parallel block, x = %d\n", x);

5. More on scope

Our problem with the variable factor is a common one. We usually need to think about the scope of each variable in a parallel block or a parallel for block. Therefore, rather than letting OpenMP decide on the scope of each variable, it’s a very good practice for us as programmers to specify the scope of each variable in a block. In fact, OpenMP provides a clause that will explicitly require us to do this: the default clause. If we add the clause

default(none)

to our parallel or parallel for directive, then the compiler will require that we specify the scope of each variable we use in the block and that has been declared outside the block. (Variables that are declared within the block are always private, since they are allocated on the thread’s stack.)

For example, using a default(none) clause, our calculation of could be written as follows:

double sum = 0.0;

default(none) reduction(+:sum) private(k, factor) \

shared(n)

for (k = 0; k < n; k++) {

if (k % 2 == 0)

factor = 1.0;

else

factor =       -1.0;

sum += factor/(2* k+1);

}

In this example, we use four variables in the for loop. With the default clause, we need to specify the scope of each. As we’ve already noted, sum is a reduction variable (which has properties of both private and shared scope). We’ve also already noted that factor and the loop variable k should have private scope. Variables that are never updated in the parallel or parallel for block, such as n in this example, can be safely shared. Recall that unlike private variables, shared variables have the same value in the parallel or parallel for block that they had before the block, and their value after the block is the same as their last value in the block. Thus, if n were initialized before the block to 1000, it would retain this value in the parallel for statement, and since the value isn’t changed in the for loop, it would retain this value after the loop has completed.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
An Introduction to Parallel Programming : Shared-Memory Programming with OpenMP : The parallel For Directive |