Improving
Work Distribution Using Scheduling
The default scheduling for a parallel for loop is called static scheduling. The iterations are
divided evenly, in chunks of consecutive iterations, between the threads. If
there are two threads working to complete 100 iterations, the first thread will
complete the first 50 iterations, and the second thread will complete the
second 50 iterations. This scheme works well in situations where each iteration
takes the same amount of time. However, in some cases, a different amount of
work will be performed in each iteration. Con-sequently, both threads may
complete the same number of iterations, but one may have more work to do in
those iterations. Listing 7.31 shows an example of this. The number of
iterations performed in the routine calc() depends on the value passed into it. The value passed into the routine
largely depends on the value of the loop induction variable i. With static scheduling, threads that get assigned the higher values of
i will take longer to complete their work.
Listing 7.31 Code
Where Each Iteration Has Different Computational Costs
double
calc( int count )
{
double d = 1.0;
for( int i=0; i < count*count; i++ )
{
d += d;
}
return d;
}
int main()
{
double data[200][100]; int i,j;
#pragma omp
parallel for private(i,j) shared(data) for ( int
i=0; i<200; i++ )
{
for ( int j=0; j<200; j++ )
{
data[i][j] = calc( i+j );
}
}
return 0;
}
Listing 7.32 shows the results of compiling and running this code using
one and two threads.
Listing 7.32 Compiling
and Running Code with One and Two Threads
$
cc -O -xopenmp -xloopinfo schedule.c
"schedule.c",
line 4: not parallelized, unsafe dependence "schedule.c", line 16:
PARALLELIZED, user pragma used "schedule.c", line 18: not
parallelized, loop inside OpenMP region $ export OMP_NUM_THREADS=1
$
timex a.out
real 4.96
user 4.94
sys 0.01
$
export OMP_NUM_THREADS=2
$
timex a.out
real
user sys
In this case, going from one to two threads decreases the runtime from
about 5 sec-onds to about 3.5 seconds.
This is less than linear scaling. Ideally, doubling the thread count should
halve the runtime. The reason for this poor scaling is that the work is
unevenly distributed between the two threads. The thread that computes the
results for the lower values of i will have fewer iterations to complete in the
innermost loop than the thread that computes the higher values of i.
We can resolve this by changing the scheduling of the loop. Instead of
having a static schedule, we can use a dynamic schedule. A dynamic schedule
means that the work is divided into multiple chunks of work. As each thread
completes a chunk of work, it takes the next chunk of work. This ensures that a
thread that completes its work faster ends up doing more chunks, whereas a
thread that takes a lot of time to complete each chunk of work will end up
completing fewer of them. Dynamic scheduling is selected using the
schedule(dynamic) clause. Listing 7.33 shows the modified code.
Listing 7.33 Code
Where Each Iteration Has Different Computational Costs
double
calc( int count )
{
double d = 1.0;
for( int i=0; i < count*count; i++ )
{
d += d;
}
return d;
}
int
main()
{
double data[200][100]; int i, j;
#pragma omp
parallel for private(i,j) shared(data) schedule(dynamic) for ( int i=0; i<200; i++ )
{
for ( int j=0; j<200; j++ )
{
data[i][j] = calc(i+j);
}
}
return 0;
}
Running this modified code on the same platform
results in a runtime of 2.5 seconds— half the original single-threaded runtime.
Dynamic scheduling avoids the issue of distributing
work evenly across the threads. However, it also incurs greater runtime
overheads. This is not so apparent in this small example. Rather than use
another code sequence to demonstrate it, it is relatively simple to explain the
reason for this increase in overhead.
With static scheduling, the threads get their
iteration limits when they start, and once completed, they can wait at a
barrier until all other threads have completed. There is no synchronization
between threads in the parallel region.
In contrast, dynamic scheduling requires that each thread complete a
unit of work that is much shorter than their share of the total iteration
count. Every time a thread completes this short chunk of work, it has to fetch
the next chunk. Every fetch of a chunk of work is a potential serialization
point, because all the threads have to cooperate to determine who gets the next
chunk. So, the increase in overhead comes from two potential factors: the number of
places where a new chunk of work is fetched and the interthread communication
costs when each new chunk of work is fetched.
The default size of the chunks for a dynamic schedule is one. Each
thread performs a single iteration before returning for more work. This can be
too low a value, resulting in significant synchronization overhead. An
additional parameter can be given to the sched-ule clause to govern the chunk
size used. This parameter can be a static value or can be calculated at
runtime. Listing 7.34 shows an example of using dynamic scheduling with a chunk
size value calculated at runtime.
Listing 7.34 Dynamic Scheduling with Chunk Size Calculated at Runtime
double sum( double *a, int n )
{
double total = 0.0;
#pragma omp parallel for reduction( +: total) schedule( dynamic, n/50) for ( int i=0; i<n; i++ )
{
total += a[i];
}
return total;
}
Another scheduling mode is guided.
With guided scheduling, the size of the chunk of work assigned is proportional
to the amount of work remaining. So, initially the threads will get assigned
large chunks of work, but then they will get smaller chunks until all the work
is completed. Guided scheduling can also take an optional parameter that
deter-mines the smallest chunk size to be used. The default minimum chunk size
is a single iteration. Listing 7.35 shows an example of guided scheduling.
Listing 7.35 Guided
Scheduling with Chunk Size Calculated at Runtime
double sum( double *a, int n )
{
double total = 0.0;
#pragma omp parallel for reduction( +: total) schedule( guided, n/50) for ( int i=0; i<n; i++ )
{
total += a[i];
}
return total;
}
There are two more scheduling modes: automatic
and runtime. The schedule(auto) clause will leave the scheduling decisions for the runtime system to
determine automatically. The schedule(runtime) clause enables the environment
variable OMP_SCHEDULE to
determine the type of schedule used.
Static scheduling can also take an optional chunk
size parameter. If a chunk size is specified for static scheduling, the work is
split into equal units of the specified chunk size. These are distributed
round-robin to the worker threads. This may mean that some worker threads have
no work assigned to them or that some threads end up with more chunks of work
than others. In the absence of a specified chunk size, the work is divided
evenly over all the threads.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.