Evaluating the OpenMP codes
Before we can compare the basic and the reduced
codes, we need to decide how to schedule the parallelized for loops. For the basic code, we’ve seen that any
schedule that divides the iterations equally among the threads should do a good
job of balanc-ing the computational load. (As usual, we’re assuming no more
than one thread/core.) We also observed that a block partitioning of the iterations
would result in fewer cache misses than a cyclic partition. Thus, we would
expect that a block schedule would be the best option for the basic version.
In the reduced code, the amount of work done in
the first phase of the computation of the forces decreases as the for loop proceeds. We’ve seen that a cyclic
schedule should do a better job of assigning more or less equal amounts of work
to each thread. In the remaining parallel for loops—the initialization of the loc
forces array, the second phase of the computation of
the forces, and the updating of the positions and velocities—the work required
is roughly the same for all the iterations. Therefore, taken out of context each of these loops will probably perform best
with a block schedule. However, the schedule of one loop can affect the
performance of another (see Exercise 6.10), so it may be that choosing a cyclic
schedule for one loop and block schedules for the others will degrade
performance.
With these choices, Table 6.3 shows the
performance of the n-body solvers when they’re run on one of our systems with no I/O.
The solver used 400 particles for 1000 timesteps. The column labeled “Default
Sched” gives times for the OpenMP reduced solver when all of the inner loops
use the default schedule, which, on our system, is a block schedule. The column
labeled “Forces Cyclic” gives times when the first phase of the forces
computation uses a cyclic schedule and the other inner loops use the default
schedule. The last column, labeled “All Cyclic,” gives times when all of
the inner loops use a cyclic schedule. The
run-times of the serial solvers differ from those of the single-threaded
solvers by less than 1%, so we’ve omitted them from the table.
Notice that with more than one thread the reduced
solver, using all default sched-ules, takes anywhere from 50 to 75% longer than
the reduced solver with the cyclic forces computation. Using the cyclic
schedule is clearly superior to the default sched-ule in this case, and any
loss in time resulting from cache issues is more than made up for by the
improved load balance for the computations.
For only two threads there is very little
difference between the performance of the reduced solver with only the first
forces loop cyclic and the reduced solver with all loops cyclic. However, as we
increase the number of threads, the performance of the reduced solver that uses
a cyclic schedule for all of the loops does start to degrade. In this
particular case, when there are more threads, it appears that the overhead
involved in changing distributions is less than the overhead incurred from
false sharing.
Finally, notice that the basic solver takes
about twice as long as the reduced solver with the cyclic scheduling of the
forces computation. So if the extra memory is avail-able, the reduced solver is
clearly superior. However, the reduced solver increases the memory requirement
for the storage of the forces by a factor of thread
count, so for very large numbers of particles, it
may be impossible to use the reduced solver.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.