Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP

Using Automatic Parallelization to Produce a Parallel Application

Using Automatic Parallelization to Produce a Parallel Application

Most compilers are able to perform some degree of automatic parallelization. In an ideal world, automatic parallelization would be just another compiler optimization, but currently there are significant limitations on what can be achieved. This is undoubtedly an area that will improve in time. However, in many instances, it is possible to assist the compiler in making the code parallel.

In this section, we will explore the ability of both the Oracle Solaris Studio and Intel compilers to perform automatic parallelization. As well as the ability to perform auto-matic parallelization, it is also important for the compilers to be able to provide feedback

www.openmp.org

on which parts of the code were parallelized and what inhibited parallelization of other regions of code.

Current compilers can only automatically parallelize loops. Loops are a very good tar-get for parallelization because they are often iterated, so the block of code will therefore accumulate significant time. As previously discussed, any parallel region must perform significant work to overcome any costs that the parallelization incurs.

Listing 7.1 shows a simple example of a loop that might be automatically parallelized.

Listing 7.1 Code to Set Up a Vector of Double-Precision Values

#include <stdlib.h>

void setup( double *vector, int length )

{

int i;

for ( i=0; i<length; i++ ) // Line 6

{

vector[i] += 1.0;

}

int main()

{

double *vector;

vector = (double*)malloc( sizeof(double)*1024*1024 );

for ( int i=0; i<1000; i++ ) // Line 16

{

setup( vector, 1024*1024 );

}

The Solaris Studio C compiler uses the flag -xautopar to enable automatic paral-lelization and the flag -xloopinfo to report information on the degree of parallelization obtained. Listing 7.2 shows the results of compiling this code snippet.

Listing 7.2 Compiling Code with Autopar

$ cc -g -xautopar -xloopinfo -O -c omp_vector.c

"setvector.c", line 6: PARALLELIZED, and serial version generated

"setvector.c", line 16: not parallelized, call may be unsafe

There are two loops in the code, and although the compiler has managed to paral-lelize the first loop, it has not been able to parallelize the second loop. The compiler reports that the function call in the second loop stopped the parallelization of the loop. We will discuss avoiding this problem later in the section.

The Intel compiler uses the option -parallel to enable parallelization and the option -par-report to report its success. The compiler also has the option -par-threshold{n}, which controls the threshold at which the compiler will parallelize a loop. The option -par-threshold0 will make the compiler parallelize all candidate loops; the default of -par-threshold100 indicates that the compiler should parallelize only those loops that are certain to benefit. Listing 7.3 shows the output from the Intel compiler on the same source file. The flag -fno-inline-functions disables function inlining in the compiler and ensures that the generated code is the same for the two compilers.

Listing 7.3 Automatic Parallelization Using the Intel Compiler

$ icc -std=c99 -O -parallel -par-report1 -par-threshold0 \ -fno-inline-functions omp_vector.c

omp_vector.c(6): (col. 3) remark: LOOP WAS AUTO-PARALLELIZED.

The number of parallel threads used in the loop is controlled by the environment variable OMP_NUM_THREADS. Listing 7.4 shows the performance of the code when run with one and two threads. It is useful to examine the time reported for the serial and parallel codes. The user time is the same in both instances, which indicates that the two codes did the same amount of work. However, the real, or wall, time is less for the paral-lel version. This is to be expected. Spreading a constant amount of work over two threads would ideally lead to each thread completing half the work.

Listing 7.4 Performance of the Parallel Code with One and Two Threads

$ export OMP_NUM_THREADS=1 $ timex a.out

real 3.55

user 3.55

sys 0.02

$ export OMP_NUM_THREADS=2 $ timex a.out

real 2.10

user 3.55

sys 0.04

As a more complex example of automatic parallelization, consider the loop in Listing 7.5, which multiplies a matrix by a vector and places the result in a second vector.

Listing 7.5 Code to Multiply a Matrix by a Vector

void matVec( double **mat, double *vec, double *out, int *row, int *col )

{

int i, j;

for ( i=0; i<*row; i++ ) // Line 5

{

out[i]=0;

for ( j=0; j<*col; j++ ) // Line 8

{

out[i] += mat[i][j] * vec[j];

}

Listing 7.6 shows the results of compiling this code with the Solaris Studio compiler.

Listing 7.6 Compiling Code with Autopar

$ cc -g -xautopar -xloopinfo -O -c fploop.c

"fploop.c", line 5: not parallelized, not a recognized for loop "fploop.c", line 8: not parallelized, not a recognized for loop

The compiler does not recognize either of the for loops as loops that can be paral-lelized. The reason for this is the possibility of aliasing between the store to out[i] and the values used to determine the loop bound, *row and *col. A requirement for the compiler to automatically parallelize the loop is that the loop bounds must remain con-stant. A store to either of the loop boundaries would violate that restriction. Therefore, it is not a form of loop that can be automatically parallelized. As a programmer, it would be unusual to write code that relies on stores to elements in the array changing the loop boundaries, but for the compiler, the only safe assumption is that these might alias.

The most general-purpose way of correcting this is to place the loop limits into local temporary variables. This removes the possibility that the loop limit might alias with one of the stores in the loop. For the code shown in Listing 7.5, it is easy to perform the equivalent change and pass the loop bounds by value rather than passing them as point-ers to the values. Listing 7.7 shows the modified loop.

Listing 7.7 Code Modified to Avoid Aliasing with Loop Counter

void matVec( double **mat, double *vec, double *out, int row, int col )

{

int i, j;

for ( i=0; i<row; i++ ) // Line 5

{

out[i]=0;

for ( j=0; j<col; j++ ) // Line 8

{

out[i] += mat[i][j] * vec[j];

}

Listing 7.8 shows the output from the compiler when this new variant of the code is compiled.

Listing 7.8 Compiling Modified Code with Automatic Parallelization

$ cc -g -xautopar -xloopinfo -O -c fploop.c

"fploop.c", line 5: not parallelized, unsafe dependence

"fploop.c", line 8: not parallelized, unsafe dependence

The code modification has enabled the compiler to recognize the loops as candidates for parallelization, but the compiler has hit a problem because the elements pointed to by out might alias with the elements pointed to either by the matrix, mat, or by the vector, vec. One way to resolve this is to use a restrict-qualified pointer to hold the location of the output array. Listing 7.9 shows the modified code for this.

Listing 7.9 Using Restrict-Qualified Pointer for Address of Output Array

void matVec(double **mat, double *vec, double * restrict out, int row, int col)

{

int i,j;

for (i=0; i<row; i++) // Line 5

{

out[i]=0;

for (j=0; j<col; j++) // Line 8

{

out[i]+=mat[i][j]*vec[j];

}

After this adjustment to the source code, the compiler is able to produce a parallel version of the loop, as shown in Listing 7.10.

Listing 7.10 Compiling Code Containing Restrict-Qualified Pointer

$ cc -g -xautopar -xloopinfo -O -c fploop.c

"fploop.c", line 5: PARALLELIZED, and serial version generated

"fploop.c", line 8: not parallelized, unsafe dependence

The Solaris Studio compiler generates two versions of the loop, a parallel version and a serial version. At runtime, the compiler will determine whether the trip count of the loop is high enough for the parallel version to run faster than the serial version.

The compiler reports that the loop at line 8 in Listing 7.9 has an unsafe dependence; the reason for this decision will be discussed in the next section, “Identifying and Parallelizing Reductions.”

Of the two loops in the code, the compiler parallelizes the outer loop but not the inner loop. This is the best decision to make for performance. The threads performing the work in parallel need to synchronize once the parallel work has completed. If the outer loop is parallelized, then the threads need to synchronize only once the outer loop has completed. If the inner loop were to be made parallel, then the threads would have to synchronize every time an iteration of the outer loop completed. The number of syn-chronization events would equal the number of times that the outer loop was iterated. Hence, it is much more efficient to make the outer loop parallel.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP : Using Automatic Parallelization to Produce a Parallel Application |