Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP

Improving Work Distribution Using Scheduling

The default scheduling for a parallel for loop is called static scheduling.

Improving Work Distribution Using Scheduling

The default scheduling for a parallel for loop is called static scheduling. The iterations are divided evenly, in chunks of consecutive iterations, between the threads. If there are two threads working to complete 100 iterations, the first thread will complete the first 50 iterations, and the second thread will complete the second 50 iterations. This scheme works well in situations where each iteration takes the same amount of time. However, in some cases, a different amount of work will be performed in each iteration. Con-sequently, both threads may complete the same number of iterations, but one may have more work to do in those iterations. Listing 7.31 shows an example of this. The number of iterations performed in the routine calc() depends on the value passed into it. The value passed into the routine largely depends on the value of the loop induction variable i. With static scheduling, threads that get assigned the higher values of i will take longer to complete their work.

Listing 7.31 Code Where Each Iteration Has Different Computational Costs

double calc( int count )

{

double d = 1.0;

for( int i=0; i < count*count; i++ )

{

d += d;

}

return d;

}

int main()

{

double data[200][100]; int i,j;

#pragma omp parallel for private(i,j) shared(data) for ( int i=0; i<200; i++ )

{

for ( int j=0; j<200; j++ )

{

data[i][j] = calc( i+j );

}

return 0;

}

Listing 7.32 shows the results of compiling and running this code using one and two threads.

Listing 7.32 Compiling and Running Code with One and Two Threads

$ cc -O -xopenmp -xloopinfo schedule.c

"schedule.c", line 4: not parallelized, unsafe dependence "schedule.c", line 16: PARALLELIZED, user pragma used "schedule.c", line 18: not parallelized, loop inside OpenMP region $ export OMP_NUM_THREADS=1

$ timex a.out

real 4.96

user 4.94

sys 0.01

$ export OMP_NUM_THREADS=2

$ timex a.out

real user sys

In this case, going from one to two threads decreases the runtime from about 5 sec-onds to about 3.5 seconds. This is less than linear scaling. Ideally, doubling the thread count should halve the runtime. The reason for this poor scaling is that the work is unevenly distributed between the two threads. The thread that computes the results for the lower values of i will have fewer iterations to complete in the innermost loop than the thread that computes the higher values of i.

We can resolve this by changing the scheduling of the loop. Instead of having a static schedule, we can use a dynamic schedule. A dynamic schedule means that the work is divided into multiple chunks of work. As each thread completes a chunk of work, it takes the next chunk of work. This ensures that a thread that completes its work faster ends up doing more chunks, whereas a thread that takes a lot of time to complete each chunk of work will end up completing fewer of them. Dynamic scheduling is selected using the schedule(dynamic) clause. Listing 7.33 shows the modified code.

Listing 7.33 Code Where Each Iteration Has Different Computational Costs

double calc( int count )

{

double d = 1.0;

for( int i=0; i < count*count; i++ )

{

d += d;

}

return d;

}

int main()

{

double data[200][100]; int i, j;

#pragma omp parallel for private(i,j) shared(data) schedule(dynamic) for ( int i=0; i<200; i++ )

{

for ( int j=0; j<200; j++ )

{

data[i][j] = calc(i+j);

}

return 0;

}

Running this modified code on the same platform results in a runtime of 2.5 seconds— half the original single-threaded runtime.

Dynamic scheduling avoids the issue of distributing work evenly across the threads. However, it also incurs greater runtime overheads. This is not so apparent in this small example. Rather than use another code sequence to demonstrate it, it is relatively simple to explain the reason for this increase in overhead.

With static scheduling, the threads get their iteration limits when they start, and once completed, they can wait at a barrier until all other threads have completed. There is no synchronization between threads in the parallel region.

In contrast, dynamic scheduling requires that each thread complete a unit of work that is much shorter than their share of the total iteration count. Every time a thread completes this short chunk of work, it has to fetch the next chunk. Every fetch of a chunk of work is a potential serialization point, because all the threads have to cooperate to determine who gets the next chunk. So, the increase in overhead comes from two potential factors: the number of places where a new chunk of work is fetched and the interthread communication costs when each new chunk of work is fetched.

The default size of the chunks for a dynamic schedule is one. Each thread performs a single iteration before returning for more work. This can be too low a value, resulting in significant synchronization overhead. An additional parameter can be given to the sched-ule clause to govern the chunk size used. This parameter can be a static value or can be calculated at runtime. Listing 7.34 shows an example of using dynamic scheduling with a chunk size value calculated at runtime.

Listing 7.34 Dynamic Scheduling with Chunk Size Calculated at Runtime

double sum( double *a, int n )

{

double total = 0.0;

#pragma omp parallel for reduction( +: total) schedule( dynamic, n/50) for ( int i=0; i<n; i++ )

{

total += a[i];

}

return total;

}

Another scheduling mode is guided. With guided scheduling, the size of the chunk of work assigned is proportional to the amount of work remaining. So, initially the threads will get assigned large chunks of work, but then they will get smaller chunks until all the work is completed. Guided scheduling can also take an optional parameter that deter-mines the smallest chunk size to be used. The default minimum chunk size is a single iteration. Listing 7.35 shows an example of guided scheduling.

Listing 7.35 Guided Scheduling with Chunk Size Calculated at Runtime

double sum( double *a, int n )

{

double total = 0.0;

#pragma omp parallel for reduction( +: total) schedule( guided, n/50) for ( int i=0; i<n; i++ )

{

total += a[i];

}

return total;

}

There are two more scheduling modes: automatic and runtime. The schedule(auto) clause will leave the scheduling decisions for the runtime system to determine automatically. The schedule(runtime) clause enables the environment variable OMP_SCHEDULE to determine the type of schedule used.

Static scheduling can also take an optional chunk size parameter. If a chunk size is specified for static scheduling, the work is split into equal units of the specified chunk size. These are distributed round-robin to the worker threads. This may mean that some worker threads have no work assigned to them or that some threads end up with more chunks of work than others. In the absence of a specified chunk size, the work is divided evenly over all the threads.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Using Automatic Parallelization and OpenMP : Improving Work Distribution Using Scheduling |