THE parallel for DIRECTIVE
As an alternative to our explicit
parallelization of the trapezoidal rule, OpenMP pro-vides the parallel for directive. Using it, we can parallelize the serial trapezoidal
rule
h = (b-a)/n;
approx = (f(a) + f(b))/2.0;
for (i = 1; i <= n-1; i++)
approx += f(a + i *h);
approx = h*approx;
by simply placing a directive immediately
before the for loop:
h = (b-a)/n;
approx = (f(a) + f(b))/2.0;
# pragma omp parallel
for num_threads(thread_count) \
reduction(+: approx)
for (i = 1; i <= n-1; i++)
approx += f(a + i*h);
approx = h*approx;
Like the parallel directive, the parallel for directive forks a team of threads to execute the following
structured block. However, the structured block following the parallel for
directive must be a for
loop. Furthermore, with the parallel for
directive the system parallelizes the for loop by dividing the iterations of the loop
among the threads. The parallel for directive is therefore very different from the parallel directive, because in a block that is preceded by a parallel directive, in general, the work must be divided among the threads
by the threads themselves.
In a for loop that has been parallelized with a parallel for directive, the default partitioning, that is, of the iterations
among the threads is up to the sys-tem. However, most systems use roughly a
block partitioning, that is, if there are m iterations, then roughly the first m/thread
count are assigned to thread 0, the next m/thread
count are assigned to thread 1, and so on.
Note that it was essential that we made approx a reduction variable. If we hadn’t, it would have been an ordinary
shared variable, and the body of the loop
approx += f(a + i*h);
would be an unprotected critical section.
However, speaking of scope, the default scope
for all variables in a parallel directive is shared, but in our parallel for if the loop variable i were shared, the variable update, i++, would also be an unprotected critical section. Hence, in a loop
that is parallelized with a parallel for directive, the default scope of the loop variable is private; in our code, each thread in the team has its
own copy of i.
1. Caveats
This is truly wonderful: It may be possible to
parallelize a serial program that consists of one large for loop by just adding a single parallel for directive. It may be possible to incrementally parallelize a
serial program that has many for loops by successively placing parallel for directives before each loop.
However, things may not be quite as rosy as
they seem. There are several caveats associated with the use of the parallel for directive. First, OpenMP will only par-allelize for loops. It won’t parallelize while loops or do while loops. This may not seem to be too much of a limitation, since any
code that uses a while loop or a do while loop can be converted to equivalent code that uses a for loop instead.
However, OpenMP will only parallelize for loops for which the number of iterations can
be determined
. from the for statement itself (that is, the code for (. . . ; . . . ; . . .)),
. andprior to execution of the loop.
For example, the “infinite loop”
for ( ; ; ) {
. . .
}
cannot be parallelized. Similarly, the loop
for (i = 0; i < n; i++) {
if ( . . . )
break;
. . .
}
cannot be parallelized, since the number of
iterations can’t be determined from the for statement alone. This for loop is also not a structured block, since the break adds another point of exit from the loop.
In fact, OpenMP will only parallelize for loops that are in canonical form. Loops in canonical form take one of the forms
shown in Program 5.3. The variables and expressions in this template are
subject to some fairly obvious restrictions:
.. The variable index must have integer or
pointer type (e.g., it can’t be a float). The expressions start, end, and incr
must have a compatible type. For example, if index is a pointer, then incr must
have integer type.
The expressions start, end, and incr must not change during execution of the loop.
During. execution of the loop, the
variable index can only be modified by the
“increment expression” in the for statement.
These restrictions allow the
run-time system to determine the number of iterations prior to execution of the
loop.
The sole exception to the
rule that the run-time system must be able to determine the number of
iterations prior to execution is that there can be a call to exit in the body of the loop.
2. Data dependences
If a for loop fails to satisfy one of the rules outlined in the preceding
section, the compiler will simply reject it. For example, suppose we try to
compile a program with the following linear search function:
int
Linear search(int key, int A[], int n) {
int
i;
/* thread_count is global */
# pragma omp parallel for num_threads(thread_count)
for
(i = 0; i < n; i++)
if
(A[i] == key) return i;
return
-1; /*
key not in list */
}
The gcc compiler reports:
Line 6: error: invalid exit from OpenMP
structured block
A more insidious problem occurs in loops in
which the computation in one iter-ation depends on the results of one or more
previous iterations. As an example, consider the following code, which computes
the first n fibonacci numbers:
fibo[0] = fibo[1] = 1;
for (i = 2; i < n; i++)
fibo[i] = fibo[i-1] + fibo[i-2];
Although we may be suspicious that something
isn’t quite right, let’s try parallellizing the for loop with a parallel for directive:
fibo[0] = fibo[1] = 1;
pragma
omp parallel for num_threads(thread_count)
for
(i = 2; i < n; i++)
fibo[i]
= fibo[i-1] + fibo[i-2];
The compiler will create an executable without
complaint. However, if we try running it with more than one thread, we may find
that the results are, at best, unpredictable. For example, on one of our
systems if we try using two threads to compute the first 10 Fibonacci numbers,
we sometimes get
1 1 2 3 5 8 13 21 34 55,
which is correct. However, we also occasionally
get
1 1 2 3 5 8 0 0 0 0.
What happened? It appears that the run-time
system assigned the computation of fibo[2], fibo[3], fibo[4], and fibo[5]
to one thread, while fibo[6], fibo[7], fibo[8], and fibo[9] were assigned to the other. (Remember the loop
starts with i = 2.) In some runs of the program, everything is fine because the
thread that was assigned fibo[2], fibo[3], fibo[4], and fibo[5] finishes its computations before the other
thread starts. However, in other runs, the first thread has evidently not
computed fibo[4] and fibo[5] when the second computes fibo[6]. It appears that the system has initialized the entries in fibo to 0, and the second thread is using the values fibo[4] = 0 and fibo[5] = 0 to compute fibo[6]. It then goes on to use fibo[5] = 0 and fibo[6] = 0 to compute fibo[7], and so on.
We see two important points here:
OpenMP
compilers don’t check for dependences among iterations in a loop that’s being
parallelized with a parallel for directive. It’s up to us, the programmers, to identify these
dependences.
A loop
in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by
OpenMP.
The dependence of the computation of fibo[6] on the computation of fibo[5] is called a data dependence. Since the value of fibo[5] is calculated in one iteration, and the result is used in a
subsequent iteration, the dependence is sometimes called a loop-carried dependence.
3. Finding loop-carried dependences
Perhaps the first thing to observe is that when
we’re attempting to use a parallel for directive, we only need to worry about loop-carried dependences. We
don’t need to worry about more general data dependences.
For example, in the loop
for (i = 0; i < n; i++) {
x[i] = a + i*h;
y[i] = exp(x[i]);
}
there is a data dependence between Lines 2 and
3. However, there is no problem with the parallelization
# pragma omp
parallel for num_threads(thread_count)
for
(I = 0; i < n; i++) {
x[i]
= a + i h;
y[i]
= exp(x[i]);
}
since the computation of x[i] and its subsequent use will always be assigned to the same thread.
Also observe that at least one of the
statements must write or update the variable in order for the statements to
represent a dependence, so in order to detect a loop-carried dependence, we
should only concern ourselves with variables that are updated by the loop body.
That is, we should look for variables that are read or written in one
iteration, and written in another. Let’s look at a couple of examples.
4. Estimating
One way to get a numerical approximation to is to use many terms in the formula
We can implement this formula in serial code
with
double factor = 1.0;
double sum =
0.0;
for (k = 0;
k < n; k++) {
sum += factor/(2* k+1);
factor = factor;
}
pi_approx =
4.0* sum;
(Why is it important that factor is a double instead of an int or a long?)
How can we parallelize this with OpenMP? We
might at first be inclined to do something like this:
1 double factor = 1.0;
double sum =
0.0;
3 # pragma omp parallel for num threads(thread count) n
reduction(+:sum)
for (k = 0;
k < n; k++) {
sum += factor/(2*k+1);
factor = factor;
}
pi_approx =
4.0 sum;
However, it’s pretty clear that the update to factor in Line 7 in iteration k and the subsequent increment of sum in Line 6 in iteration k+1 is an instance of a loop-carried dependence.
If iteration k is assigned to one thread and iteration k+1 is assigned to another thread, there’s no guarantee that the value
of factor in Line 6 will be correct. In this case we can fix the problem by
examining the series
We see that in iteration k the value of factor should be (- 1)k, which is +1 if k is even and -1 if k is odd, so if we replace the code
we will eliminate the loop dependency.
However, things still aren’t quite right. If we
run the program on one of our systems with just two threads and n = 1000, the result is consistently wrong. For
example,
With n = 1000 terms and 2
threads,
2 Our estimate of pi = 2.97063289263385
With n = 1000 terms and 2
threads,
Our estimate of pi =
3.22392164798593
On the other hand, if we run the program with
only one thread, we always get
With n = 1000 terms and 1
threads,
Our estimate of pi =
3.14059265383979
What’s wrong here?
Recall that in a block that has been
parallelized by a parallel for directive, by default any variable declared before the loop—with
the sole exception of the loop variable—is shared among the threads. So factor is shared and, for exam-ple, thread 0 might assign it the value 1,
but before it can use this value in the update to sum, thread 1 could assign it the value 1. Therefore, in addition to
eliminating the loop-carried dependence in the calculation of factor, we need to insure that each thread has its own copy of factor. That is, in order to make our code correct, we need to also
insure that factor has private scope. We can do this by adding a private clause to the parallel for directive.
double
sum = 0.0;
# pragma omp parallel for num threads(thread count) n
reduction(+:sum) private(factor)
for
(k = 0; k < n; k++) f
if
(k % 2 == 0)
factor = 1.0;
else
factor
= 1.0;
sum
+= factor/(2 k+1);
g
The private clause specifies that for each variable listed
inside the parentheses, a private copy is to be created for each thread. Thus,
in our example, each of the thread count threads will have its own copy of the variable factor, and hence the updates of one thread to factor won’t affect the value of factor in another thread.
It’s important to remember that the value of a
variable with private scope is unspecified at the beginning of a parallel block or a parallel for block. Its value is also unspecified after completion of a parallel or parallel for block. So, for example, the output of the first printf statement in the following code is unspeci-fied, since it prints
the private variable x before it’s explicitly initialized. Similarly,
the output of the final printf is unspecified, since it prints x after the completion of the parallel block.
int
x = 5;
# pragma omp parallel num threads(thread
count) \
private(x)
f
int
my_rank = omp_get_thread_num();
printf("Thread
%d > before initialization, x = %d\n",
my_rank,
x);
x
= 2* my_rank + 2;
printf("Thread
%d > after initialization, x = %d\n",
my_rank,
x);
}
printf("After
parallel block, x = %d\n", x);
5. More on scope
Our problem with the variable factor is a common one. We usually need to think about the scope of each
variable in a parallel block or a parallel for block. Therefore, rather than letting OpenMP decide on the scope
of each variable, it’s a very good practice for us as programmers to specify
the scope of each variable in a block. In fact, OpenMP provides a clause that
will explicitly require us to do this: the default clause. If we add the clause
default(none)
to our parallel or parallel for directive, then the compiler will require that we specify the
scope of each variable we use in the block and that has been declared outside
the block. (Variables that are declared within the block are always private,
since they are allocated on the thread’s stack.)
For example, using a default(none) clause, our calculation of could be written as follows:
double sum = 0.0;
# pragma omp parallel for num threads(thread
count) \
default(none) reduction(+:sum) private(k,
factor) \
shared(n)
for (k = 0; k < n; k++) {
if (k % 2 == 0)
factor = 1.0;
else
factor = -1.0;
sum += factor/(2* k+1);
}
In this example, we use four variables in the for loop. With the default clause, we need to
specify the scope of each. As we’ve already noted, sum is a reduction variable (which has properties of both private and
shared scope). We’ve also already noted that factor
and the loop variable k should have private scope. Variables that are never updated in the parallel or parallel for block, such as n in this example, can be safely shared. Recall
that unlike private variables, shared variables have the same value in the parallel or parallel for block that they had before the block, and their value after the
block is the same as their last value in the block. Thus, if n were initialized before the block to 1000, it would retain this
value in the parallel for statement, and since the value isn’t changed in the for loop, it would retain this value after the
loop has completed.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.