Parallelizing tree search
Let’s take a look at parallelizing tree search.
The tree structure suggests that we identify tasks with tree nodes. If we do
this, the tasks will communicate down the tree edges: a parent will communicate
a new partial tour to a child, but a child, except for terminating, doesn’t
communicate directly with a parent.
We also need to take into consideration the
updating and use of the best tour. Each task examines the best tour to
determine whether the current partial tour is feasible or the current complete
tour has lower cost. If a leaf task determines its tour is a better tour, then
it will also update the best tour. Although all of the actual computation can
be considered to be carried out by the tree
node tasks, we need to keep in mind that the best tour data structure requires
additional communication that is not explicit in the tree edges. Thus, it’s
convenient to add an additional task that corresponds to the best tour. It
“sends” data to every tree node task, and receives data from some of the
leaves. This latter view is convenient for shared-memory, but not so convenient
for distributed-memory.
A natural way to agglomerate and map the tasks
is to assign a subtree to each thread or process, and have each thread/process
carry out all the tasks in its subtree. For example, if we have three threads
or processes, as shown earlier in Figure 6.10, we might map the subtree rooted
at 0 - > 1 to thread/process 0, the subtree rooted at 0 -> 2 to thread/process 1, and the subtree rooted at 0 -> 3 to thread/process 2.
Mapping details
There are many possible algorithms for
identifying which subtrees we assign to the processes or threads. For example,
one thread or process could run the last version of serial depth-first search
until the stack stores one partial tour for each thread or process. Then it
could assign one tour to each thread or process. The problem with depth-first
search is that we expect a subtree whose root is deeper in the tree to require
less work than a subtree whose root is higher up in the tree, so we would
probably get better load balance if we used something like breadth-first search to identify the subtrees.
As the name suggests, breadth-first search
searches as widely as possible in the tree before going deeper. So if, for
example, we carry out a breadth-first search until we reach a level of the tree
that has at least thread count or comm sz nodes, we can then divide the nodes at this
level among the threads or processes. See Exercise 6.18 for implementation
details.
The best tour data structure
On a shared-memory system, the best tour data
structure can be shared. In this setting, the Feasible function can simply examine the data
structure. However, updates to the best tour will cause a race condition, and
we’ll need some sort of locking to prevent errors. We’ll discuss this in more
detail when we implement the parallel version.
In the case of a distributed-memory system,
there are a couple of choices that we need to make about the best tour. The
simplest option would be to have the processes operate independently of each
other until they have completed searching their sub-trees. In this setting,
each process would store its own local best tour. This local best tour would be used by the process in Feasible and updated by the process each time it calls Update_best_tour. When all the processes have finished searching, they can perform
a global reduction to find the tour with the global least cost.
This approach has the virtue of simplicity, but
it also suffers from the problem that it’s entirely possible for a process to
spend most or all of its time searching through partial tours that couldn’t
possibly lead to a global best tour. Thus, we should probably try using an
approach that makes the current global best tour available to all the
processes. We’ll take a look at details when we discuss the MPI implementation.
Dynamic mapping of tasks
A second issue we should consider is the
problem of load imbalance. Although the use of breadth-first search ensures
that all of our subtrees have approximately the same number of nodes, there is
no guarantee that they all have the same amount of work. It’s entirely possible
that one process or thread will have a subtree con-sisting of very expensive
tours, and, as a consequence, it won’t need to search very deeply into its
assigned subtree. However, with our current, static mapping of tasks to threads/processes, this one thread or process
will simply have to wait until the other threads/processes are done.
An alternative is to implement a dynamic mapping scheme. In a dynamic scheme, if one
thread/process runs out of useful work, it can obtain additional work from
another thread/process. In our final implementation of serial depth-first
search, each stack record contains a partial tour. With this data structure a
thread or process can give additional work to another thread/process by dividing
the contents of its stack. This might at first seem to have the potential for
causing problems with the program’s correctness, since if we give part of one
thread’s or one process’ stack to another, there’s a good chance that the order
in which the tree nodes will be visited will be changed.
However, we’re already going to do this; when
we assign different subtrees to different threads/processes, the order in which
the tree nodes are visited is no longer the serial depth-first ordering. In
fact, in principle, there’s no reason to visit any node before any other node
as long as we make sure we visit “ancestors” before “descen-dants.” But this
isn’t a problem since a partial tour isn’t added to the stack until after all
its ancestors have been visited. For example, in Figure 6.10 the node
consisting of the tour 0 ! 2 ! 1 will be pushed onto the stack when the node
consisting of the tour 0 ! 2 is the currently active node, and
consequently the two nodes won’t be on the stack simultaneously. Similarly, the
parent of 0 ! 2, the root of the tree, 0, is no longer on the stack when 0 ! 2 is visited.
A second alternative for dynamic load
balancing—at least in the case of shared memory—would be to have a shared
stack. However, we couldn’t simply dispense with the local stacks. If a thread
needed to access the shared stack every time it pushed or popped, there would
be a tremendous amount of contention for the shared stack and the performance
of the program would probably be worse than a serial program. This is exactly
what happened when we parallelized the reduced n-body solver with mutexes/locks protecting the calculations of the
total forces on the various particles. If every call to Push or Pop formed a critical section, our program would
grind to nearly a complete halt. Thus, we would want to retain local stacks for
each thread, with only occasional accesses to the shared stack. We won’t pursue
this alternative. See Programming Assignment 6.7 for further details.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.