Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP

Identifying and Parallelizing Reductions

When a loop reduces a large amount of data down to a smaller set of values, the operation is called a reduction.

Identifying and Parallelizing Reductions

When a loop reduces a large amount of data down to a smaller set of values, the operation is called a reduction. The classic example of a reduction is computing the sum of an array of numbers, as shown in Listing 7.11.

Listing 7.11 Calculating the Sum of an Array of Numbers

double sum( double* array, int length )

{

double total=0; int i;

for ( i=0; i<length; i++ )

{

total += array[i];

}

return total;

}

To create a reduction in parallel, the reduction operator needs to be commutative— performing the operations in a different order must not cause an incorrect result. The possible reduction operations are addition, subtraction, multiplication, and the logical operations, such as AND or OR, as well as the operations MIN and MAX when applied to an array of numbers.

However, some operations on floating-point numbers cannot be reordered without causing some potential numeric differences to the output. The addition of floating-point values is a good example of this. There are situations where adding A and B and then C will give a different numeric value than adding C and A and then B. The order of the operations is important. This is not a problem unique to parallel codes; serial codes have the same ordering constraints. However, for parallel codes, this constraint may stop a compiler from producing a parallel version of a code construct.

To consider a contrived example, assume you have an array of floating point numbers sorted from the largest element to the smallest element. When you sum the elements in this array, the value may reach a point where the sum has become so large that adding the small elements onto this sum causes no impact, because the increase is less than can be registered as an increase in the variable holding the total.

Suppose you take this same array and use two threads to compute the result. Each thread computes a partial total over half of the range of numbers. The first thread calcu-lates the sum of the first half of the array, which contains the list of the large numbers. The second thread calculates the sum of the second half of the array, which contains the list of the small numbers. Now the second thread will compute a much smaller value for its partial total. When two small positive values are added together, it is more likely that the result is greater than the largest of the two values. In contrast, if a small value is added to a much larger value, it is likely that the result will be identical to the larger value. The consequence of this is that the small values will accumulate and be recorded in the summation computed by the second thread.

At the end of the parallel region, the values from both threads are added to produce the final result. The value that the second thread computes is potentially large enough to cause a small change in value when added to the large result from the first thread. The result in the computation by using two threads is likely to be different from the result computed using only one thread. The difference might only be in the smallest significant figure, or it might even be a rounding difference. But, the result could potentially be different.

Some compilers place the decision of whether to perform reductions under the con-trol of the user. The Solaris Studio compiler requires that the user specify the flag -xreduction to parallelize reduction operations. The Intel compiler does not require an additional flag to recognize reductions. We can see the results of using this flag on the code from Listing 7.5 in the output shown in Listing 7.12.

Listing 7.12 Parallelization of Reduction Operations

$ cc -g -xautopar -xloopinfo -xreduction -O -c fploop.c

"fploop.c", line 5: PARALLELIZED, and serial version generated

"fploop.c", line 8: not parallelized, not profitable

The compiler output shows that it recognized the inner loop at line 8 and did not declare it an unsafe dependency. Instead, the loop is reported as not being profitable to parallelize. This is the expected behavior; as we have previously discussed, it is much more effective to parallelize the outer loop and leave the inner loop as serial code.

Reductions are present in many codes, and it is usually appropriate to parallelize them as long as the developer is aware that this may cause a difference in the generated results.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP : Identifying and Parallelizing Reductions |