Chapter: An Introduction to Parallel Programming : Shared-Memory Programming with OpenMP

The Reduction Clause

If we developed a serial implementation of the trapezoidal rule, we’d probably use a slightly different function prototype.

THE REDUCTION CLAUSE

If we developed a serial implementation of the trapezoidal rule, we’d probably use a slightly different function prototype. Rather than

void Trap(double a, double b, int n, double* global_result_p);

we would probably define

double Trap(double a, double b, int n);

and our function call would be

global_result = Trap(a, b, n);

This is somewhat easier to understand and probably more attractive to all but the most fanatical believers in pointers.

We resorted to the pointer version because we needed to add each thread’s local calculation to get global result. However, we might prefer the following function prototype:

double Local_trap(double a, double b, int n);

With this prototype, the body of Local trap would be the same as the Trap function in Program 5.2, except that there would be no critical section. Rather, each thread would return its part of the calculation, the final value of its my result variable. If we made this change, we might try modifying our parallel block so that it looks like this:

global_result = 0.0;

# pragma omp parallel num_threads(thread_count)

{

# pragma omp critical

global_result += Local_trap(double a, double b, int n);

}

Can you see a problem with this code? It should give the correct result. However, since we’ve specified that the critical section is

global_result += Local_trap(double a, double b, int n);

the call to Local trap can only be executed by one thread at a time, and, effectively, we’re forcing the threads to execute the trapezoidal rule sequentially. If we check the run-time of this version, it may actually be slower with multiple threads than one thread (see Exercise 5.3).

We can avoid this problem by declaring a private variable inside the parallel block and moving the critical section after the function call:

global_result = 0.0;

# pragma omp parallel num_threads(thread_count) {

double my_result = 0.0; /* private */

my_result += Local_trap(double a, double b, int n);

pragma omp critical

global_result += my_result;

}

Now the call to Local trap is outside the critical section, and the threads can exe-cute their calls simultaneously. Furthermore, since my result is declared in the parallel block, it’s private, and before the critical section each thread will store its part of the calculation in its my result variable.

OpenMP provides a cleaner alternative that also avoids serializing execution of Local trap: we can specify that global result is a reduction variable. A reduction operator is a binary operation (such as addition or multiplication) and a reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result. Furthermore, all of the inter-mediate results of the operation should be stored in the same variable: the reduction variable. For example, if A is an array of n ints, the computation

int sum = 0;

for (i = 0; i < n; i++) sum += A[i];

is a reduction in which the reduction operator is addition.

In OpenMP it may be possible to specify that the result of a reduction is a reduc-tion variable. To do this, a reduction clause can be added to a parallel directive. In our example, we can modify the code as follows:

global_result = 0.0;

# pragma omp parallel num threads(thread count) n reduction(+: global result)

global_result += Local_trap(double a, double b, int n);

First note that the parallel directive is two lines long. Recall that C preprocessor directives are, by default, only one line long, so we need to “escape” the newline character by putting a backslash (n) immediately before it.

The code specifies that global result is a reduction variable and the plus sign (“+”) indicates that the reduction operator is addition. Effectively, OpenMP creates a private variable for each thread, and the run-time system stores each thread’s result in this private variable. OpenMP also creates a critical section and the values stored in the private variables are added in this critical section. Thus, the calls to Local trap can take place in parallel.

The syntax of the reduction clause is

reduction(<operator>: <variable list>)

In C, operator can be any one of the operators +, *,- , &, |, ˆ, &&, || , although the use of subtraction is a bit problematic, since subtraction isn’t associative or commutative. For example, the serial code

result = 0;

for (i = 1; i <= 4; i++)

result -= i;

stores the value -10 in result. If, however, we split the iterations among two threads, with thread 0 subtracting 1 and 2 and thread 1 subtracting 3 and 4, then thread 0 will compute 3 and thread 1 will compute -7 and, of course, -3 –(-7) = 4.

In principle, the compiler should determine that the threads’ individual results should actually be added (−3 + (−7) = −10), and, in practice, this seems to be the case. However, the OpenMP Standard [42] doesn’t seem to guarantee this.

It should also be noted that if a reduction variable is a float or a double, the results may differ slightly when different numbers of threads are used. This is due to

the fact that floating point arithmetic isn’t associative. For example, if a, b, and c are floats, then (a + b) + c may not be exactly equal to a + (b + c). See Exercise 5.5.

When a variable is included in a reduction clause, the variable itself is shared. However, a private variable is created for each thread in the team. In the parallel block each time a thread executes a statement involving the variable, it uses the pri-vate variable. When the parallel block ends, the values in the private variables are combined into the shared variable. Thus, our latest version of the code

global_result = 0.0;

# pragma omp parallel num_threads(thread_count) \

reduction(+: global result)

global result += Local trap(double a, double b, int n);

effectively executes code that is identical to our previous version:

global_result = 0.0;

# pragma omp parallel num threads(thread count) {

double my_result = 0.0; /* private */

my_result += Local_trap(double a, double b, int n);

# pragma omp critical

global_result += my_result;

}

One final point to note is that the threads’ private variables are initialized to 0. This is analogous to our initializing my result to zero. In general, the private variables created for a reduction clause are initialized to the identity value for the opera-tor. For example, if the operator is multiplication, the private variables would be initialized to 1.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Shared-Memory Programming with OpenMP : The Reduction Clause |