Parallelizing tree search
Let’s take a look at parallelizing tree search. The tree structure suggests that we identify tasks with tree nodes. If we do this, the tasks will communicate down the tree edges: a parent will communicate a new partial tour to a child, but a child, except for terminating, doesn’t communicate directly with a parent.
We also need to take into consideration the updating and use of the best tour. Each task examines the best tour to determine whether the current partial tour is feasible or the current complete tour has lower cost. If a leaf task determines its tour is a better tour, then it will also update the best tour. Although all of the actual computation can
be considered to be carried out by the tree node tasks, we need to keep in mind that the best tour data structure requires additional communication that is not explicit in the tree edges. Thus, it’s convenient to add an additional task that corresponds to the best tour. It “sends” data to every tree node task, and receives data from some of the leaves. This latter view is convenient for shared-memory, but not so convenient for distributed-memory.
A natural way to agglomerate and map the tasks is to assign a subtree to each thread or process, and have each thread/process carry out all the tasks in its subtree. For example, if we have three threads or processes, as shown earlier in Figure 6.10, we might map the subtree rooted at 0 - > 1 to thread/process 0, the subtree rooted at 0 -> 2 to thread/process 1, and the subtree rooted at 0 -> 3 to thread/process 2.
There are many possible algorithms for identifying which subtrees we assign to the processes or threads. For example, one thread or process could run the last version of serial depth-first search until the stack stores one partial tour for each thread or process. Then it could assign one tour to each thread or process. The problem with depth-first search is that we expect a subtree whose root is deeper in the tree to require less work than a subtree whose root is higher up in the tree, so we would probably get better load balance if we used something like breadth-first search to identify the subtrees.
As the name suggests, breadth-first search searches as widely as possible in the tree before going deeper. So if, for example, we carry out a breadth-first search until we reach a level of the tree that has at least thread count or comm sz nodes, we can then divide the nodes at this level among the threads or processes. See Exercise 6.18 for implementation details.
The best tour data structure
On a shared-memory system, the best tour data structure can be shared. In this setting, the Feasible function can simply examine the data structure. However, updates to the best tour will cause a race condition, and we’ll need some sort of locking to prevent errors. We’ll discuss this in more detail when we implement the parallel version.
In the case of a distributed-memory system, there are a couple of choices that we need to make about the best tour. The simplest option would be to have the processes operate independently of each other until they have completed searching their sub-trees. In this setting, each process would store its own local best tour. This local best tour would be used by the process in Feasible and updated by the process each time it calls Update_best_tour. When all the processes have finished searching, they can perform a global reduction to find the tour with the global least cost.
This approach has the virtue of simplicity, but it also suffers from the problem that it’s entirely possible for a process to spend most or all of its time searching through partial tours that couldn’t possibly lead to a global best tour. Thus, we should probably try using an approach that makes the current global best tour available to all the processes. We’ll take a look at details when we discuss the MPI implementation.
Dynamic mapping of tasks
A second issue we should consider is the problem of load imbalance. Although the use of breadth-first search ensures that all of our subtrees have approximately the same number of nodes, there is no guarantee that they all have the same amount of work. It’s entirely possible that one process or thread will have a subtree con-sisting of very expensive tours, and, as a consequence, it won’t need to search very deeply into its assigned subtree. However, with our current, static mapping of tasks to threads/processes, this one thread or process will simply have to wait until the other threads/processes are done.
An alternative is to implement a dynamic mapping scheme. In a dynamic scheme, if one thread/process runs out of useful work, it can obtain additional work from another thread/process. In our final implementation of serial depth-first search, each stack record contains a partial tour. With this data structure a thread or process can give additional work to another thread/process by dividing the contents of its stack. This might at first seem to have the potential for causing problems with the program’s correctness, since if we give part of one thread’s or one process’ stack to another, there’s a good chance that the order in which the tree nodes will be visited will be changed.
However, we’re already going to do this; when we assign different subtrees to different threads/processes, the order in which the tree nodes are visited is no longer the serial depth-first ordering. In fact, in principle, there’s no reason to visit any node before any other node as long as we make sure we visit “ancestors” before “descen-dants.” But this isn’t a problem since a partial tour isn’t added to the stack until after all its ancestors have been visited. For example, in Figure 6.10 the node consisting of the tour 0 ! 2 ! 1 will be pushed onto the stack when the node consisting of the tour 0 ! 2 is the currently active node, and consequently the two nodes won’t be on the stack simultaneously. Similarly, the parent of 0 ! 2, the root of the tree, 0, is no longer on the stack when 0 ! 2 is visited.
A second alternative for dynamic load balancing—at least in the case of shared memory—would be to have a shared stack. However, we couldn’t simply dispense with the local stacks. If a thread needed to access the shared stack every time it pushed or popped, there would be a tremendous amount of contention for the shared stack and the performance of the program would probably be worse than a serial program. This is exactly what happened when we parallelized the reduced n-body solver with mutexes/locks protecting the calculations of the total forces on the various particles. If every call to Push or Pop formed a critical section, our program would grind to nearly a complete halt. Thus, we would want to retain local stacks for each thread, with only occasional accesses to the shared stack. We won’t pursue this alternative. See Programming Assignment 6.7 for further details.