Identifying and
Parallelizing Reductions
When a loop reduces a large amount of data down to a smaller set of
values, the operation is called a reduction.
The classic example of a reduction is computing the sum of an array of numbers,
as shown in Listing 7.11.
Listing 7.11 Calculating
the Sum of an Array of Numbers
double sum( double* array, int length )
{
double total=0; int i;
for ( i=0; i<length; i++ )
{
total += array[i];
}
return total;
}
To create a reduction in parallel, the reduction
operator needs to be commutative— performing the operations in a different
order must not cause an incorrect result. The possible reduction operations are
addition, subtraction, multiplication, and the logical operations, such as AND or OR, as well as the operations MIN and MAX when applied to an array of
numbers.
However, some operations on floating-point numbers
cannot be reordered without causing some potential numeric differences to the
output. The addition of floating-point values is a good example of this. There
are situations where adding A and B and then C will give a different numeric
value than adding C and A and then B. The order of the operations is important.
This is not a problem unique to parallel codes; serial codes have the same
ordering constraints. However, for parallel codes, this constraint may stop a
compiler from producing a parallel version of a code construct.
To consider a contrived example, assume you have an array of floating
point numbers sorted from the largest element to the smallest element. When you
sum the elements in this array, the value may reach a point where the sum has
become so large that adding the small elements onto this sum causes no impact,
because the increase is less than can be registered as an increase in the
variable holding the total.
Suppose you take this same array and use two
threads to compute the result. Each thread computes a partial total over half
of the range of numbers. The first thread calcu-lates the sum of the first half
of the array, which contains the list of the large numbers. The second thread
calculates the sum of the second half of the array, which contains the list of
the small numbers. Now the second thread will compute a much smaller value for
its partial total. When two small positive values are added together, it is
more likely that the result is greater than the largest of the two values. In
contrast, if a small value is added to a much larger value, it is likely that
the result will be identical to the larger value. The consequence of this is
that the small values will accumulate and be recorded in the summation computed
by the second thread.
At the end of the parallel region, the values from
both threads are added to produce the final result. The value that the second
thread computes is potentially large enough to cause a small change in value
when added to the large result from the first thread. The result in the
computation by using two threads is likely to be different from the result
computed using only one thread. The difference might only be in the smallest
significant figure, or it might even be a rounding difference. But, the result
could potentially be different.
Some compilers place the decision of whether to
perform reductions under the con-trol of the user. The Solaris Studio compiler
requires that the user specify the flag -xreduction to parallelize reduction operations. The Intel compiler does not require
an additional flag to recognize reductions. We can
see the results of using this flag on the code from Listing 7.5 in the output
shown in Listing 7.12.
Listing 7.12 Parallelization
of Reduction Operations
$
cc -g -xautopar -xloopinfo -xreduction
-O -c fploop.c
"fploop.c",
line 5: PARALLELIZED, and serial version generated
"fploop.c",
line 8: not parallelized, not profitable
The compiler output shows that it recognized the
inner loop at line 8 and did not declare it an unsafe dependency. Instead, the
loop is reported as not being profitable to parallelize. This is the expected
behavior; as we have previously discussed, it is much more effective to
parallelize the outer loop and leave the inner loop as serial code.
Reductions are present in many codes, and it is
usually appropriate to parallelize them as long as the developer is aware that
this may cause a difference in the generated results.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.