Collapsing Loops to Improve Workload Balance
The parallel for directive applies only to the next loop. As always, it is best to apply parallelization at the outermost loop, because this reduces the number of synchronizations necessary. However, a low trip count for the outer loop will limit the maximum number of threads that can be used in parallel. In these cases, it might be appropriate to parallelize the inner loop, since this could have a higher iteration count. Without know-ing the trip counts for the two loops, it is not possible to decide which strategy is more appropriate.
However, OpenMP provides a way of avoiding issues with the outermost loop having a low trip count, which is to collapse the inner and outer loops into a single loop. The clause to do this is collapse, which takes the number of loops to collapse as a parameter. Listing 7.61 shows an example of a code where the outer loop has a low trip count, and using the collapse clause enables scaling to higher numbers of threads.
Listing 7.61 Using the collapse Clause to Improve Scaling
#pragma omp parallel for collapse( 2 )
for( int i=0; i<2; i++ )
for( int j=0; j<10000; j++ )
array[i][j] = sin( i+j );
Without the collapse clause, the outermost loop will only ever scale to two threads. With the collapse clause, the combined loop can be up to a theoretical 20,000 threads (although the synchronization overheads would cause the code to run slowly far before that count was reached). Using the collapse clause may introduce additional overhead into the parallel region, so it is worth evaluating whether the clause will improve per-formance or cause a performance loss.