Using
Automatic Parallelization to Produce a Parallel Application
Most compilers are able to perform some degree of automatic
parallelization. In an ideal world, automatic parallelization would be just
another compiler optimization, but currently there are significant limitations
on what can be achieved. This is undoubtedly an area that will improve in time.
However, in many instances, it is possible to assist the compiler in making the
code parallel.
In this section, we will explore the ability of both the Oracle Solaris
Studio and Intel compilers to perform automatic parallelization. As well as the
ability to perform auto-matic parallelization, it is also important for the
compilers to be able to provide feedback
on which parts of the code were parallelized and what inhibited
parallelization of other regions of code.
Current compilers can only automatically
parallelize loops. Loops are a very good tar-get for parallelization because
they are often iterated, so the block of code will therefore accumulate
significant time. As previously discussed, any parallel region must perform
significant work to overcome any costs that the parallelization incurs.
Listing 7.1 shows a simple example of a loop that might be automatically
parallelized.
Listing 7.1 Code to Set Up a Vector of Double-Precision Values
#include <stdlib.h>
void setup( double *vector, int length )
{
int i;
for
( i=0; i<length; i++ ) // Line 6
{
vector[i] += 1.0;
}
}
int main()
{
double *vector;
vector = (double*)malloc( sizeof(double)*1024*1024 );
for ( int i=0; i<1000; i++ ) //
Line 16
{
setup( vector, 1024*1024 );
}
}
The Solaris Studio C compiler
uses the flag -xautopar to
enable automatic paral-lelization and the flag -xloopinfo to report information on the degree of parallelization obtained.
Listing 7.2 shows the results of compiling this code snippet.
Listing 7.2 Compiling
Code with Autopar
$ cc -g -xautopar -xloopinfo -O
-c omp_vector.c
"setvector.c", line 6: PARALLELIZED, and serial version
generated
"setvector.c",
line 16: not parallelized, call may be unsafe
There are two loops in the code, and although the compiler has managed
to paral-lelize the first loop, it has not been able to parallelize the second
loop. The compiler reports that the function call in the second loop stopped
the parallelization of the loop. We will discuss avoiding this problem later in
the section.
The Intel compiler uses the option -parallel to enable parallelization and the option -par-report to report its success. The compiler also has the option
-par-threshold{n}, which controls the threshold at which the compiler will parallelize a
loop. The option -par-threshold0 will make
the compiler parallelize all candidate loops; the default of
-par-threshold100 indicates that the compiler
should parallelize only those loops that are certain to benefit. Listing 7.3 shows the output from the Intel
compiler on the same source file. The flag -fno-inline-functions disables function inlining in the compiler and ensures that the
generated code is the same for the two compilers.
Listing 7.3 Automatic Parallelization Using the Intel Compiler
$ icc -std=c99 -O -parallel -par-report1 -par-threshold0 \ -fno-inline-functions omp_vector.c
omp_vector.c(6):
(col. 3) remark: LOOP WAS AUTO-PARALLELIZED.
The number of parallel threads used in the loop is
controlled by the environment variable OMP_NUM_THREADS. Listing 7.4 shows the performance of the code when run with one and
two threads. It is useful to examine the time reported for the serial and
parallel codes. The user time is the same in both instances, which indicates
that the two codes did the same amount of work. However, the real, or wall,
time is less for the paral-lel version. This is to be expected. Spreading a
constant amount of work over two threads would ideally lead to each thread
completing half the work.
Listing 7.4 Performance
of the Parallel Code with One and Two Threads
$
export OMP_NUM_THREADS=1 $ timex a.out
real 3.55
user 3.55
sys 0.02
$
export OMP_NUM_THREADS=2 $ timex a.out
real 2.10
user 3.55
sys 0.04
As a more complex example of automatic parallelization, consider the
loop in Listing 7.5, which multiplies a matrix by a vector and places the
result in a second vector.
Listing 7.5 Code
to Multiply a Matrix by a Vector
void
matVec( double **mat, double *vec, double *out, int *row, int *col )
{
int i, j;
for ( i=0; i<*row; i++ ) // Line 5
{
out[i]=0;
for
( j=0; j<*col; j++ ) // Line 8
{
out[i]
+= mat[i][j] * vec[j];
}
}
}
Listing 7.6 shows the results of compiling this code with the Solaris
Studio compiler.
Listing 7.6 Compiling
Code with Autopar
$ cc -g -xautopar -xloopinfo -O
-c fploop.c
"fploop.c", line 5: not parallelized, not
a recognized for loop "fploop.c", line 8: not parallelized, not a
recognized for loop
The compiler does not recognize either of the for loops as loops that can be paral-lelized. The reason for this is the
possibility of aliasing between the store to out[i] and the values used to determine the loop bound, *row and *col. A
requirement for the compiler to automatically parallelize the loop is that the
loop bounds must remain con-stant. A store to either of the loop boundaries
would violate that restriction. Therefore, it is not a form of loop that can be
automatically parallelized. As a programmer, it would be unusual to write code
that relies on stores to elements in the array changing the loop boundaries,
but for the compiler, the only safe assumption is that these might alias.
The most general-purpose way of
correcting this is to place the loop limits into local temporary variables.
This removes the possibility that the loop limit might alias with one of the
stores in the loop. For the code shown in Listing 7.5, it is easy to perform
the equivalent change and pass the loop bounds by value rather than passing
them as point-ers to the values. Listing 7.7 shows the modified loop.
Listing 7.7 Code
Modified to Avoid Aliasing with Loop Counter
void matVec( double **mat, double *vec, double
*out, int row, int col )
{
int i, j;
for ( i=0; i<row; i++ ) //
Line 5
{
out[i]=0;
for ( j=0; j<col; j++ ) //
Line 8
{
out[i] += mat[i][j] * vec[j];
}
}
}
Listing 7.8 shows the output from the compiler when
this new variant of the code is compiled.
Listing 7.8 Compiling
Modified Code with Automatic Parallelization
$
cc -g -xautopar -xloopinfo -O -c
fploop.c
"fploop.c",
line 5: not parallelized, unsafe dependence
"fploop.c",
line 8: not parallelized, unsafe dependence
The code modification has enabled the compiler to
recognize the loops as candidates for parallelization, but the compiler has hit
a problem because the elements pointed to by out might alias with the elements pointed to either by the matrix, mat, or by the vector, vec. One way to resolve this is to
use a restrict-qualified pointer to hold the location of the output array.
Listing 7.9 shows the modified code for this.
Listing 7.9 Using
Restrict-Qualified Pointer for Address of Output Array
void matVec(double **mat, double *vec, double *
restrict out, int row, int col)
{
int i,j;
for (i=0; i<row; i++) // Line 5
{
out[i]=0;
for (j=0; j<col; j++) //
Line 8
{
out[i]+=mat[i][j]*vec[j];
}
}
}
After this adjustment to the source code, the
compiler is able to produce a parallel version of the loop, as shown in Listing
7.10.
Listing 7.10 Compiling
Code Containing Restrict-Qualified Pointer
$
cc -g -xautopar -xloopinfo -O -c
fploop.c
"fploop.c",
line 5: PARALLELIZED, and serial version generated
"fploop.c", line 8: not parallelized, unsafe dependence
The Solaris Studio compiler generates two versions
of the loop, a parallel version and a serial version. At runtime, the compiler
will determine whether the trip count of the loop is high enough for the
parallel version to run faster than the serial version.
The compiler reports that the loop at line 8 in Listing 7.9 has an
unsafe dependence; the reason for this decision will be discussed in the next
section, “Identifying and Parallelizing Reductions.”
Of the two loops in the code, the compiler
parallelizes the outer loop but not the inner loop. This is the best decision
to make for performance. The threads performing the work in parallel need to
synchronize once the parallel work has completed. If the outer loop is parallelized,
then the threads need to synchronize only once the outer loop has completed. If
the inner loop were to be made parallel, then the threads would have to
synchronize every time an iteration of the outer loop completed. The number of
syn-chronization events would equal the number of times that the outer loop was
iterated. Hence, it is much more efficient to make the outer loop parallel.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.