Chapter: An Introduction to Parallel Programming : Parallel Program Development

Implementation of tree search using MPI and static partitioning

The vast majority of the code used in the static parallelizations of tree search using Pthreads and OpenMP is taken straight from the second implementation of serial, iterative tree search.

Implementation of tree search using MPI and static partitioning

The vast majority of the code used in the static parallelizations of tree search using Pthreads and OpenMP is taken straight from the second implementation of serial, iterative tree search. In fact, the only differences are in starting the threads, the ini-tial partitioning of the tree, and the Update best tour function. We might therefore expect that an MPI implementation would also require relatively few changes to the serial code, and this is, in fact, the case.

There is the usual problem of distributing the input data and collecting the results. In order to construct a complete tour, a process will need to choose an edge into each vertex and out of each vertex. Thus, each tour will require an entry from each row and each column for each city that’s added to the tour, so it would clearly be advantageous for each process to have access to the entire adjacency matrix. Note that the adjacency matrix is going to be relatively small. For example, even if we have 100 cities, it’s unlikely that the matrix will require more than 80,000 bytes of storage, so it makes sense to simply read in the matrix on process 0 and broadcast it to all the processes.

Once the processes have copies of the adjacency matrix, the bulk of the tree search can proceed as it did in the Pthreads and OpenMP implementations. The principal differences lie in

._. partitioning the tree,

_. checking and updating the best tour, and

after the search has terminated, making sure that process 0 has a copy of the best tour for output.

We’ll discuss each of these in turn.

Partitioning the tree

In the Pthreads and OpenMP implementations, thread 0 uses breadth-first search to search the tree until there are at least thread_count partial tours. Each thread then determines which of these initial partial tours it should get and pushes its tours onto its local stack. Certainly MPI process 0 can also generate a list of comm_sz partial tours. However, since memory isn’t shared, it will need to send the initial partial tours to the appropriate process. We could do this using a loop of sends, but distributing the initial partial tours looks an awful lot like a call to MPI_Scatter. In fact, the only reason we can’t use MPI Scatter is that the number of initial partial tours may not be evenly divisible by comm_sz. When this happens, process 0 won’t be sending the same number of tours to each process, and MPI_Scatter requires that the source of the scatter send the same number of objects to each process in the communicator.

Fortunately, there is a variant of MPI_Scatter, MPI_Scatterv, which can be used to send different numbers of objects to different processes. First recall the syntax of MPI_Scatter:

int MPI_Scatter(

void sendbuf /* in */,

int sendcount /* in */,

MPI_Datatype sendtype /* in */,

void recvbuf /* out */,

int recvcount /* in */,

MPI_Datatype recvtype /* in */,

int root /* in */,

MPI_Comm comm /* in */);

Process root sends sendcount objects of type sendtype from sendbuf to each pro-cess in comm. Each process in comm receives recvcount objects of type recvtype into recvbuf. Most of the time, sendtype and recvtype are the same and sendcount and recvcount are also the same. In any case, it’s clear that the root process must send the same number of objects to each process.

MPI_Scatterv, on the other hand, has syntax

int MPI_Scatterv(

void sendbuf /* in */,

int sendcounts /* in */,

int displacements /* in */,

MPI_Datatype sendtype /* in */,

void recvbuf /* out */,

int recvcount /* in */,

MPI_Datatype recvtype /* in */,

int root /* in */,

MPI_Comm comm /* in */);

The single sendcount argument in a call to MPI_Scatter is replaced by two array arguments: sendcounts and displacements. Both of these arrays contain comm sz elements: sendcounts[q] is the number of objects of type sendtype being sent to process q. Furthermore, displacements[q] specifies the start of the block that is being sent to process q. The displacement is calculated in units of type sendtype. So, for example, if sendtype is MPI_INT, and sendbuf has type int* , then the data that is sent to process q will begin in location

sendbuf + displacements[q]

In general, displacements[q] specifies the offset into sendbuf of the data that will go to process q. The “units” are measured in blocks with extent equal to the extent of sendtype.

Similarly, MPI_Gatherv generalizes MPI_Gather:

int MPI_Gatherv(

void* sendbuf /* in */,

int sendcount /* in */,

MPI_Datatype sendtype /* in */,

void* recvbuf /* out */,

int* recvcounts /* in */,

int* displacements /* in */,

MPI_Datatype recvtype /* in */,

int root /* in */,

MPI_Comm comm /* in */);

Maintaining the best tour

As we observed in our earlier discussion of parallelizing tree search, having each process use its own best tour is likely to result in a lot of wasted computation since the best tour on one process may be much more costly than most of the tours on another process (see Exercise 6.21). Therefore, when a process finds a new best tour, it should send it to the other processes.

First note that when a process finds a new best tour, it really only needs to send its cost to the other processes. Each process only makes use of the cost of the current best tour when it calls Best tour. Also, when a process updates the best tour, it doesn’t care what the actual cities on the former best tour were; it only cares that the cost of the former best tour is greater than the cost of the new best tour.

During the tree search, when one process wants to communicate a new best cost to the other processes, it’s important to recognize that we can’t use MPI Bcast, for recall that MPI Bcast is blocking and every process in the communicator must call MPI Bcast. However, in parallel tree search the only process that will know that a broadcast should be executed is the process that has found a new best cost. If it tries to use MPI Bcast, it will probably block in the call and never return, since it will be the only process that calls it. We therefore need to arrange that the new tour is sent in such a way that the sending process won’t block indefinitely.

MPI provides several options. The simplest is to have the process that finds a new best cost use MPI Send to send it to all the other processes:

for (dest = 0; dest < comm._sz; dest++) if (dest != my_rank)

The destination processes can periodically check for the arrival of new best tour costs. We can’t use MPI Recv to check for messages since it’s blocking; if a process calls

MPI_Recv(&received_cost, 1, MPI_INT, MPI_ANY_SOURCE, NEW_COST_TAG, comm, &status);

MPI_Send(&new_best_cost, 1, MPI_INT, dest, NEW_COST_TAG, comm);

Here, we’re using a special tag defined in our program, NEW COST TAG. This will tell the receiving process that the message is a new cost–as opposed to some other type of message–for example, a tour.

the process will block until a matching message arrives. If no message arrives—for example, if no process finds a new best cost—the process will hang. Fortunately, MPI provides a function that only checks to see if a message is available; it doesn’t actually try to receive a message. It’s called MPI Iprobe, and its syntax is

int MPI_Iprobe(

int source /* in */,

int tag /* in */,

MPI_Comm comm /* in */,

int msg_avail_p /* out */,

MPI_Status status_p /* out */);

It checks to see if a message from process rank source in communicator comm and with tag tag is available. If such a message is available, *msg_avail_p will be assigned the value true and the members of *status_p will be assigned the appro-priate values. For example, status p->MPI_SOURCE will be assigned the rank of the source of the message that’s been received. If no message is available, *msg_avail_p will be assigned the value false. The source and tag arguments can be the wildcards MPI_ANY_SOURCE and MPI_ANY_TAG, respectively. So, to check for a message with a new cost from any process, we can call

MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm, &msg_avail, &status);

Program 6.9: MPI code to check for new best tour costs

MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm, &msg_avail, &status);

while (msg_avail) {

MPI_Recv(&received_cost, 1, MPI_INT, status.MPI_SOURCE, NEW COST TAG, comm, MPI_STATUS_IGNORE);

if (received_cost < best_tour_cost) best_tour_cost = received_cost;

MPI_Iprobe(MPI_ANY_SOURCE, NEW_COST_TAG, comm, &msg_avail, &status);

/* while */

If msg_avail is true, then we can receive the new cost with a call to MPI_Recv:

MPI_Recv(&received cost, 1, MPI_INT, status.MPI_SOURCE, NEW_COST_TAG, comm, MPI_STATUS_IGNORE);

A natural place to do this is in the Best tour function. Before checking whether our new tour is the best tour, we can check for new tour costs from other processes with the code in Program 6.9.

This code will continue to receive messages with new costs as long as they’re available. Each time a new cost is received that’s better than the current best cost, the variable best_tour_cost will be updated.

Did you spot the potential problem with this scheme? If there is no buffering available for the sender, then the loop of calls to MPI_Send can cause the send-ing process to block until a matching receive is posted. If all the other processes have completed their searches, the sending process will hang. The loop of calls to MPI_Send is therefore unsafe.

There are a couple of alternatives provided by MPI: buffered sends and non-blocking sends. We’ll discuss buffered sends here. See Exercise 6.22 for a discussion of nonblocking operations in MPI.

Modes and Buffered Sends

MPI provides four modes for sends: standard, synchronous, ready, and buffered. The various modes specify different semantics for the sending functions. The send that we first learned about, MPI_Send, is the standard mode send. With it, the MPI implementation can decide whether to copy the contents of the message into its own storage or to block until a matching receive is posted. Recall that in synchronous mode, the send will block until a matching receive is posted. In ready mode, the send is erroneous unless a matching receive is posted before the send is started. In buffered mode, the MPI implementation must copy the message into local temporary storage if a matching receive hasn’t been posted. The local temporary storage must be provided by the user program, not the MPI implementation.

Each mode has a different function: MPI Send, MPI Ssend, MPI Rsend, and MPI Bsend, respectively, but the argument lists are identical to the argument lists for MPI Send:

int MPI_Xsend( /* in */,

void message

int message size /* in +/,

MPI_Datatype message_type /* in */,

int dest /* in */,

int tag /* in */,

MPI_Comm comm /* in */);

The buffer that’s used by MPI_Bsend must be turned over to the MPI implementation with a call to MPI_Buffer_attach:

int MPI_Buffer_attach(

void buffer /* in */,

int buffer_size /* in */);

The buffer argument is a pointer to a block of memory allocated by the user program and buffer size is its size in bytes. A previously “attached” buffer can be reclaimed by the program with a call to

int MPI_Buffer_detach(

void buf p /* out */,

int buf_size p /* out */);

The *buf_p argument returns the address of the block of memory that was previously attached, and *buf_size_p gives its size in bytes. A call to MPI_Buffer_detach will block until all messages that have been stored in the buffer are transmitted. Note that since buf_p is an output argument, it should probably be passed in with the ampersand operator. For example:

char buffer[1000];

char buf;

int buf size;

. . .

MPI_Buffer_attach(buffer, 1000);

. . .

/* Calls to MPI Bsend */

. . .

MPI_Buffer_detach(&buf, &buf_size);

At any point in the program only one user-provided buffer can be attached, so if there may be multiple buffered sends that haven’t been completed, we need to estimate the amount of data that will be buffered. Of course, we can’t know this with any certainty, but we do know that in any “broadcast” of a best tour, the process doing the broadcast will make comm sz 1 calls to MPI Bsend, and each of these calls will send a single int. We can thus determine the size of the buffer needed for a single broadcast. The amount of storage that’s needed for the data that’s transmitted can be determined with a call to MPI Pack size:

int MPI_Pack_size(

int count /* in */,

MPI_Datatype datatype /* in */,

MPI_Comm comm /* in */,

int size_p /* out */);

The output argument gives an upper bound on the number of bytes needed to store the data in a message. This won’t be enough, however. Recall that in addition to the data, a message stores information such as the destination, the tag, and the communi-cator, so for each message there is some additional overhead. An upper bound on this additional overhead is given by the MPI constant MPI BSEND OVERHEAD. For a single broadcast, the following code determines the amount of storage needed:

int data_size;

int message_size;

int bcast buf_size;

MPI_Pack_size(1, MPI_INT, comm, &data_size);

message_size = data_size + MPI_BSEND_OVERHEAD;

bcast_buf_size = (comm_sz - 1)* message_size;

We should guess a generous upper bound on the number of broadcasts and multiply that by bcast_buf_size to get the size of the buffer to attach.

Printing the best tour

When the program finishes, we’ll want to print out the actual tour as well as its cost, so we do need to get the tour to process 0. It might at first seem that we could arrange this by having each process store its local best tour—the best tour that it finds—and when the tree search has completed, each process can check its local best tour cost and compare it to the global best tour cost. If they’re the same, the process could send its local best tour to process 0. There are, however, several problems with this. First, it’s entirely possible that there are multiple “best” tours in the TSP digraph, tours that all have the same cost, and different processes may find these different tours. If this happens, multiple processes will try to send their best tours to process 0, and all but one of the threads could hang in a call to MPI Send. A second problem is that it’s possible that one or more processes never received the best tour cost, and they may try to send a tour that isn’t optimal.

We can avoid these problems by having each process store its local best tour, but after all the processes have completed their searches, they can all call MPI Allreduce and the process with the global best tour can then send it to process 0 for output. The following pseudocode provides some details:

struct {

int cost;

int rank;

} loc_data, global_data;

loc_data_cost = Tour_cost(loc_best tour);

loc_data_rank = my_rank;

MPI_Allreduce(&loc_data, &global_data, 1, MPI_2INT, MPI_MINLOC, comm);

if (global data.rank == 0) return;

/* 0 already has the best tour */

if (my_rank == 0)

Receive best tour from process global data.rank;

else if (my_rank == global data.rank)

Send best tour to process 0;

The key here is the operation we use in the call to MPI Allreduce. If we just used MPI MIN, we would know what the cost of the global best tour was, but we wouldn’t know who owned it. However, MPI provides a predefined operator, MPI MINLOC, which operates on pairs of values. The first value is the value to be minimized—in our setting, the cost of the tour—and the second value is the location of the minimum—in our setting, the rank of the process that actually owns the best tour. If more than one process owns a tour with minimum cost, the location will be the lowest of the ranks of the processes that own a minimum cost tour. The input and the output buffers in the call to MPI Allreduce are two-member structs. Since both the cost and the rank are ints, both members are ints. Note that MPI also provides a predefined type MPI 2INT for this type. When the call to MPI Allreduce returns, we have two alternatives:

.. If process 0 already has the best tour, we simply return. Otherwise, the process owning the best tour sends it to process 0.

Unreceived messages

As we noted in the preceding discussion, it is possible that some messages won’t be received during the execution of the parallel tree search. A process may finish searching its subtree before some other process has found a best tour. This won’t cause the program to print an incorrect result; the call to MPI_Allreduce that finds the process with the best tour won’t return until every process has called it, and some process will have the best tour. Thus, it will return with the correct least-cost tour, and process 0 will receive this tour.

However, unreceived messages can cause problems with the call to MPI_Buffer_detach or the call to MPI_Finalize. A process can hang in one of these calls if it is storing buffered messages that were never received, so before we attempt to shut down MPI, we can try to receive any outstanding messages by using MPI_Iprobe. The code is very similar to the code we used to check for new best tour costs. See Program 6.9. In fact, the only messages that are not sent in collectives are the “best tour” message sent to process 0, and the best tour cost broadcasts. The MPI collectives will hang if some process doesn’t participate, so we only need to look for unreceived best tours.

In the dynamically load-balanced code (which we’ll discuss shortly) there are other messages, including some that are potentially quite large. To handle this situa-tion, we can use the status argument returned by MPI_Iprobe to determine the size of the message and allocate additional storage as necessary (see Exercise 6.23).

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Parallel Program Development : Implementation of tree search using MPI and static partitioning |