Implementation of tree search using MPI and
static partitioning
The vast majority of the code used in the
static parallelizations of tree search using Pthreads and OpenMP is taken
straight from the second implementation of serial, iterative tree search. In
fact, the only differences are in starting the threads, the ini-tial
partitioning of the tree, and the Update
best tour function. We might therefore
expect that an MPI implementation would also require relatively few changes to
the serial code, and this is, in fact, the case.
There is the usual problem of distributing the
input data and collecting the results. In order to construct a complete tour, a
process will need to choose an edge into each vertex and out of each vertex.
Thus, each tour will require an entry from each row and each column for each
city that’s added to the tour, so it would clearly be advantageous for each
process to have access to the entire adjacency matrix. Note that the adjacency
matrix is going to be relatively small. For example, even if we have 100
cities, it’s unlikely that the matrix will require more than 80,000 bytes of storage,
so it makes sense to simply read in the matrix on process 0 and broadcast it to
all the processes.
Once the processes have copies of the adjacency
matrix, the bulk of the tree search can proceed as it did in the Pthreads and
OpenMP implementations. The principal differences lie in
.. partitioning the tree,
. checking and updating the best tour, and
after the search has terminated, making sure
that process 0 has a copy of the best tour for output.
We’ll discuss each of these in turn.
Partitioning the tree
In the Pthreads and OpenMP implementations,
thread 0 uses breadth-first search to search the tree until there are at least thread_count partial tours. Each thread then determines which of these initial
partial tours it should get and pushes its tours onto its local stack.
Certainly MPI process 0 can also generate a list of comm_sz partial tours. However, since memory isn’t shared, it will need to
send the initial partial tours to the appropriate process. We could do this
using a loop of sends, but distributing the initial partial tours looks an
awful lot like a call to MPI_Scatter. In fact, the only reason we can’t use MPI Scatter is that the number of initial partial tours may not be evenly
divisible by comm_sz. When this happens, process 0 won’t be sending the same number of
tours to each process, and MPI_Scatter requires that the source of the scatter send
the same number of objects to each process in the communicator.
Fortunately, there is a variant of MPI_Scatter, MPI_Scatterv, which can be used to send different
numbers of objects to different processes. First recall the syntax of MPI_Scatter:
int MPI_Scatter(
void sendbuf /* in */,
int sendcount /* in */,
MPI_Datatype sendtype /* in */,
void recvbuf /* out */,
int recvcount /* in */,
MPI_Datatype recvtype /* in */,
int root /* in */,
MPI_Comm comm /* in */);
Process root sends sendcount objects of type sendtype from sendbuf to each pro-cess in comm. Each process in comm receives recvcount objects of type recvtype into recvbuf. Most of the time, sendtype and recvtype are the same and sendcount and recvcount are also the same. In any case, it’s clear that
the root process must send the same number of objects to each process.
MPI_Scatterv, on the other hand, has syntax
int MPI_Scatterv(
void sendbuf /* in */,
int sendcounts /* in */,
int displacements /* in */,
MPI_Datatype sendtype /* in */,
void recvbuf /* out */,
int recvcount /* in */,
MPI_Datatype recvtype /* in */,
int root /* in */,
MPI_Comm comm /* in */);
The single sendcount argument in a call to MPI_Scatter is replaced by two array arguments: sendcounts and displacements. Both of these arrays contain comm sz elements: sendcounts[q] is the number of objects of type sendtype being sent to process q. Furthermore, displacements[q] specifies the start of the block that is being
sent to process q. The displacement is calculated in units of type sendtype. So, for example, if sendtype is MPI_INT, and sendbuf has type int* , then the data that is sent to process q will begin in location
sendbuf + displacements[q]
In general, displacements[q] specifies the offset into sendbuf of the data that will go to process q. The “units” are measured in blocks with extent equal to the
extent of sendtype.
Similarly, MPI_Gatherv generalizes MPI_Gather:
int MPI_Gatherv(
void* sendbuf /* in */,
int sendcount /* in */,
MPI_Datatype sendtype /* in */,
void* recvbuf /* out */,
int* recvcounts /* in */,
int* displacements /* in */,
MPI_Datatype recvtype /* in */,
int root /* in */,
MPI_Comm comm /* in */);
Maintaining the best tour
As we observed in our earlier discussion of
parallelizing tree search, having each process use its own best tour is likely
to result in a lot of wasted computation since the best tour on one process may
be much more costly than most of the tours on another process (see Exercise
6.21). Therefore, when a process finds a new best tour, it should send it to
the other processes.
First note that when a process finds a new best
tour, it really only needs to send its cost to the other processes. Each process only makes use of the cost of
the current best tour when it calls Best tour. Also, when a process updates the best tour,
it doesn’t care what the actual cities on the former best tour were; it only
cares that the cost of the former best tour is greater than the cost of the new
best tour.
During the tree search, when one process wants
to communicate a new best cost to the other processes, it’s important to recognize
that we can’t use MPI Bcast, for recall that MPI Bcast is blocking and every process in the communicator must call MPI Bcast. However, in parallel tree search the only process that will know
that a broadcast should be executed is the process that has found a new best
cost. If it tries to use MPI Bcast, it will probably block in the call and never
return, since it will be the only process that calls it. We therefore need to
arrange that the new tour is sent in such a way that the sending process won’t
block indefinitely.
MPI provides several options. The simplest is
to have the process that finds a new best cost use MPI Send to send it to all the other processes:
for (dest = 0; dest < comm._sz; dest++) if
(dest != my_rank)
MPI_Recv(&received_cost, 1, MPI_INT, MPI_ANY_SOURCE, NEW_COST_TAG, comm, &status);
MPI_Send(&new_best_cost, 1, MPI_INT, dest,
NEW_COST_TAG, comm);
Here, we’re using a special tag defined in our program, NEW COST TAG. This will tell the receiving process that the message is a new cost–as opposed to some other type of message–for example, a tour.
the process will block until a matching message
arrives. If no message arrives—for example, if no process finds a new best
cost—the process will hang. Fortunately, MPI provides a function that only checks to see if a message is available; it doesn’t
actually try to receive a message. It’s called MPI
Iprobe, and its syntax is
int MPI_Iprobe(
int source /* in */,
int tag /* in */,
MPI_Comm comm /* in */,
int msg_avail_p /* out */,
MPI_Status status_p /* out */);
It checks to see if a message from process rank
source in communicator comm and with tag tag is available. If such a message is available, *msg_avail_p will be assigned the value true and the members of *status_p will be assigned the appro-priate values. For example, status p->MPI_SOURCE will be assigned the rank of the source of the message that’s been
received. If no message is available, *msg_avail_p will be assigned the value false. The source and tag arguments can be the wildcards MPI_ANY_SOURCE and MPI_ANY_TAG, respectively. So, to check for a message with
a new cost from any process, we can call
MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm, &msg_avail,
&status);
Program 6.9: MPI code to check for new best tour costs
MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm,
&msg_avail, &status);
while (msg_avail) {
MPI_Recv(&received_cost, 1, MPI_INT,
status.MPI_SOURCE, NEW COST TAG, comm, MPI_STATUS_IGNORE);
if (received_cost < best_tour_cost) best_tour_cost
= received_cost;
MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm,
&msg_avail, &status);
/* while */
If msg_avail is true, then we can receive the new cost with
a call to MPI_Recv:
MPI_Recv(&received cost, 1, MPI_INT,
status.MPI_SOURCE, NEW_COST_TAG, comm, MPI_STATUS_IGNORE);
A natural place to do this is in the Best tour function. Before checking whether our new tour is the best tour,
we can check for new tour costs from other processes with the code in Program
6.9.
This code will continue to receive messages
with new costs as long as they’re available. Each time a new cost is received
that’s better than the current best cost, the variable best_tour_cost will be updated.
Did you spot the potential problem with this
scheme? If there is no buffering available for the sender, then the loop of
calls to MPI_Send can cause the send-ing process to block until a matching receive
is posted. If all the other processes have completed their searches, the
sending process will hang. The loop of calls to MPI_Send is therefore unsafe.
There are a couple of alternatives provided by
MPI: buffered
sends and non-blocking sends. We’ll discuss buffered sends here. See
Exercise 6.22 for a discussion of nonblocking operations in MPI.
Modes and Buffered Sends
MPI provides four modes for sends: standard, synchronous, ready, and buffered. The various modes specify different semantics for the sending
functions. The send that we first learned about, MPI_Send, is the standard mode send. With it, the MPI
implementation can decide whether to copy the contents of the message into its
own storage or to block until a matching receive is posted. Recall that in
synchronous mode, the send will block until a matching receive is posted. In
ready mode, the send is erroneous unless a matching receive is posted before the send is started. In buffered mode, the MPI
implementation must copy the message into local temporary storage if a matching
receive hasn’t been posted. The local temporary storage must be provided by the
user program, not the MPI implementation.
Each mode has a different function: MPI Send, MPI Ssend, MPI Rsend, and MPI Bsend, respectively, but the argument lists are
identical to the argument lists for MPI Send:
int MPI_Xsend( /* in */,
void message
int message
size /* in +/,
MPI_Datatype message_type /*
in */,
int dest /*
in */,
int tag /*
in */,
MPI_Comm comm /*
in */);
The buffer that’s used by MPI_Bsend must be
turned over to the MPI implementation with a call to MPI_Buffer_attach:
int MPI_Buffer_attach(
void buffer /* in */,
int buffer_size /* in */);
The buffer argument is a pointer to a block of
memory allocated by the user program and buffer size is its size in bytes. A
previously “attached” buffer can be reclaimed by the program with a call to
int MPI_Buffer_detach(
void buf
p /* out */,
int buf_size
p /* out */);
The *buf_p argument returns the address of the
block of memory that was previously attached, and *buf_size_p gives its size in
bytes. A call to MPI_Buffer_detach will block until all messages that have been
stored in the buffer are transmitted. Note that since buf_p is an output
argument, it should probably be passed in with the ampersand operator. For
example:
char buffer[1000];
char buf;
int buf size;
. . .
MPI_Buffer_attach(buffer, 1000);
. . .
/* Calls
to MPI Bsend */
. . .
MPI_Buffer_detach(&buf, &buf_size);
At any point in the program only one
user-provided buffer can be attached, so if there may be multiple buffered
sends that haven’t been completed, we need to estimate the amount of data that
will be buffered. Of course, we can’t know this with any certainty, but we do
know that in any “broadcast” of a best tour, the process doing the broadcast
will make comm sz 1 calls to MPI Bsend, and each of these calls will send a
single int. We can thus determine the size of the buffer needed for a single
broadcast. The amount of storage that’s needed for the data that’s transmitted
can be determined with a call to MPI Pack size:
int MPI_Pack_size(
int count /* in */,
MPI_Datatype datatype /* in */,
MPI_Comm comm /* in */,
int size_p /* out */);
The output argument gives an upper bound on the
number of bytes needed to store the data in a message. This won’t be enough,
however. Recall that in addition to the data, a message stores information such
as the destination, the tag, and the communi-cator, so for each message there
is some additional overhead. An upper bound on this additional overhead is
given by the MPI constant MPI BSEND OVERHEAD. For a single broadcast, the
following code determines the amount of storage needed:
int data_size;
int message_size;
int bcast buf_size;
MPI_Pack_size(1, MPI_INT, comm, &data_size);
message_size = data_size + MPI_BSEND_OVERHEAD;
bcast_buf_size = (comm_sz - 1)* message_size;
We should guess a generous upper bound on the
number of broadcasts and multiply that by bcast_buf_size to get the size of the
buffer to attach.
Printing the best tour
When the program finishes, we’ll want to print
out the actual tour as well as its cost, so we do need to get the tour to
process 0. It might at first seem that we could arrange this by having each
process store its local best tour—the best tour that it finds—and when the tree
search has completed, each process can check its local best tour cost and
compare it to the global best tour cost. If they’re the same, the process could
send its local best tour to process 0. There are, however, several problems
with this. First, it’s entirely possible that there are multiple “best” tours
in the TSP digraph, tours that all have the same cost, and different processes
may find these different tours. If this happens, multiple processes will try to
send their best tours to process 0, and all but one of the threads could hang in a call to MPI Send. A second problem is that it’s possible that one or more
processes never received the best tour cost, and they may try to send a tour
that isn’t optimal.
We can avoid these problems by
having each process store its local best tour, but after all the processes have
completed their searches, they can all call MPI Allreduce
and the process with the global best tour can then send it to process 0 for
output. The following pseudocode provides some details:
struct
{
int cost;
int rank;
} loc_data, global_data;
loc_data_cost = Tour_cost(loc_best
tour);
loc_data_rank = my_rank;
MPI_Allreduce(&loc_data, &global_data, 1, MPI_2INT, MPI_MINLOC, comm);
if (global data.rank == 0) return;
/* 0 already has the best tour */
if (my_rank == 0)
Receive best tour from process global data.rank;
else if (my_rank == global data.rank)
Send best tour to process 0;
The key here is the operation we use in the call
to MPI Allreduce. If we just used MPI MIN, we would know what the cost of the
global best tour was, but we wouldn’t know who owned it. However, MPI provides
a predefined operator, MPI MINLOC, which operates on pairs of values. The first
value is the value to be minimized—in our setting, the cost of the tour—and the
second value is the location of the minimum—in our setting, the rank of the
process that actually owns the best tour. If more than one process owns a tour
with minimum cost, the location will be the lowest of the ranks of the
processes that own a minimum cost tour. The input and the output buffers in the
call to MPI Allreduce are two-member structs. Since both the cost and the rank
are ints, both members are ints. Note that MPI also provides a predefined type
MPI 2INT for this type. When the call to MPI Allreduce returns, we have two
alternatives:
.. If process 0 already has the best tour, we
simply return. Otherwise, the process owning the best tour sends it to process
0.
Unreceived messages
As we noted in the preceding discussion, it is possible that some messages won’t be received during the execution of the parallel tree search. A process may finish searching its subtree before some other process has found a best tour. This won’t cause the program to print an incorrect result; the call to MPI_Allreduce that finds the process with the best tour won’t return until every process has called it, and some process will have the best tour. Thus, it will return with the correct least-cost tour, and process 0 will receive this tour.
However, unreceived messages can cause problems with the call to MPI_Buffer_detach or the call to MPI_Finalize. A process can hang in one of these calls if it is storing buffered messages that were never received, so before we attempt to shut down MPI, we can try to receive any outstanding messages by using MPI_Iprobe. The code is very similar to the code we used to check for new best tour costs. See Program 6.9. In fact, the only messages that are not sent in collectives are the “best tour” message sent to process 0, and the best tour cost broadcasts. The MPI collectives will hang if some process doesn’t participate, so we only need to look for unreceived best tours.
In the dynamically load-balanced code (which
we’ll discuss shortly) there are other messages, including some that are
potentially quite large. To handle this situa-tion, we can use the status argument returned by MPI_Iprobe to determine the size of the message and
allocate additional storage as necessary (see Exercise 6.23).
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.