PARALLEL SOFTWARE
Parallel
hardware has arrived. Virtually all desktop and server systems use multicore
processors. The same cannot be said for parallel software. Except for operating
systems, database systems, and Web servers, there is currently very little
commodity software that makes extensive use of parallel hardware. As we noted
in Chapter 1, this is a problem because we can no longer rely on hardware and
compilers to provide a steady increase in application performance. If we’re to
continue to have routine increases in application performance and application
power, software developers must learn to write applications that exploit
shared- and distributed-memory archi-tectures. In this section we’ll take a
look at some of the issues involved in writing software for parallel systems.
First,
some terminology. Typically when we run our shared-memory programs, we’ll start
a single process and fork multiple threads. So when we discuss shared-memory
programs, we’ll talk about threads carrying out tasks. On the other hand, when we
run distributed-memory programs, we’ll start multiple processes, and we’ll talk
about processes carrying out tasks. When the discussion
applies equally well to shared-memory and distributed-memory systems, we’ll
talk about processes/threads carrying out tasks.
Caveats
Before
proceeding, we need to stress some of the limitations of this section. First,
here, and in the remainder of the book, we’ll only be discussing software for
MIMD systems. For example, while the use of GPUs as a platform for parallel
computing continues to grow at a rapid pace, the application programming
interfaces (APIs) for GPUs are necessarily very different from standard MIMD
APIs. Second, we stress that our coverage is only meant to give some idea of
the issues: there is no attempt to be comprehensive.
Finally,
we’ll mainly focus on what’s often called single program, multiple data, or SPMD, programs. Instead
of running a different program on each core, SPMD programs consist of a single
executable that can behave as if it were multiple different programs through
the use of conditional branches. For example,
if (I’m thread/process 0) do this;
else
do that;
Observe
that SPMD programs can readily implement data-parallelism. For example,
if (I’m thread/process 0)
operate on the first half of
the array; else /* I’m
thread/process 1 */
operate on the second half of
the array;
Recall
that a program is task parallel if it obtains its parallelism by dividing
tasks
among
the threads or processes. The first example makes it clear that SPMD programs
can also implement task-parallelism.
Coordinating the processes/threads
In a
very few cases, obtaining excellent parallel performance is trivial. For
example, suppose we have two arrays and we want to add them:
double x[n], y[n];
. . .
for (int i = 0; i < n;
i++) x[i] += y[i];
In order
to parallelize this, we only need to assign elements of the arrays to the
processes/threads. For example, if we have p processes/threads, we might
make process/thread 0 responsible for elements 0, …, n/p 1, process/thread 1 would be
responsible for elements n/p, :::, 2n/p 1, and so on.
So for
this example, the programmer only needs to
Divide the work among the processes/threads
in such a way that each process/thread gets roughly the same amount
of work, and
in such a way that the amount of communication required is
minimized.
Recall
that the process of dividing the work among the processes/threads so that
is
satisfied is called load balancing. The two qualifications on
dividing the work are obvious, but nonetheless important. In many cases it
won’t be necessary to give much thought to them; they typically become concerns
in situations in which the amount of work isn’t known in advance by the programmer,
but rather the work is generated as the program runs. For an example, see the
tree search problem in Chapter 6.
Although
we might wish for a term that’s a little easier to pronounce, recall that the
process of converting a serial program or algorithm into a parallel pro-gram is
often called parallelization. Programs that can be
parallelized by simply dividing the work among the processes/threads are
sometimes said to be embar-rassingly parallel. This is a bit unfortunate,
since it suggests that programmers should be embarrassed to have written an
embarrassingly parallel program, when, to the contrary, successfully devising a
parallel solution to any problem is a cause for great rejoicing.
Alas,
the vast majority of problems are much more determined to resist our efforts to
find a parallel solution. As we saw in Chapter 1, for these problems, we need
to coordinate the work of the processes/threads. In these programs, we also
usually need to
Arrange for the processes/threads to synchronize.
Arrange for communication among the processes/threads.
These
last two problems are often interrelated. For example, in distributed-memory
programs, we often implicitly synchronize the processes by communicating among
them,
and in shared-memory programs, we often communicate among the threads by
synchronizing them. We’ll say more about both issues below.
Shared-memory
As we noted earlier, in shared-memory programs,
variables can be shared or private. Shared variables can be read or written by any thread, and
private variables can ordi-narily only be accessed by one thread. Communication
among the threads is usually done through shared variables, so communication is
implicit, rather than explicit.
Dynamic and static threads
In many
environments shared-memory programs use dynamic threads. In this paradigm, there is
often a master thread and at any given instant a (possibly empty) collection of
worker threads. The master thread typically waits for work requests— for
example, over a network—and when a new request arrives, it forks a worker
thread, the thread carries out the request, and when the thread completes the
work, it terminates and joins the master thread. This paradigm makes efficient
use of sys-tem resources since the resources required by a thread are only
being used while the thread is actually running.
An
alternative to the dynamic paradigm is the static thread paradigm. In this paradigm,
all of the threads are forked after any needed setup by the master thread and
the threads run until all the work is completed. After the threads join the
master thread, the master thread may do some cleanup (e.g., free memory) and
then it also terminates. In terms of resource usage, this may be less
efficient: if a thread is idle, its resources (e.g., stack, program counter,
and so on.) can’t be freed. However, fork-ing and joining threads can be fairly
time-consuming operations. So if the necessary resources are available, the
static thread paradigm has the potential for better perfor-mance than the
dynamic paradigm. It also has the virtue that it’s closer to the most widely
used paradigm for distributed-memory programming, so part of the mindset that
is used for one type of system is preserved for the other. Hence, we’ll often
use the static thread paradigm.
Nondeterminism
In any
MIMD system in which the processors execute asynchronously it is likely that
there will be nondeterminism. A computation is
nondeterministic if a given input can result in different outputs. If multiple
threads are executing independently, the relative rate at which they’ll
complete statements varies from run to run, and hence the results of the
program may be different from run to run. As a very simple example, suppose we
have two threads, one with id or rank 0 and the other with id or rank 1.
Suppose also that each is storing a private variable my x, thread 0’s value for my x is 7, and thread 1’s is 19. Further, suppose
both threads execute the following code:
. . .
printf("Thread %d > my val = %dnn", my rank, my x);
. . .
Then the
output could be
Thread 0 > my_val = 7
Thread 1 > my_val = 19
but it
could also be
Thread 1 > my_val = 19
Thread 0 > my_val = 7
In fact,
things could be even worse: the output of one thread could be broken up by the
output of the other thread. However, the point here is that because the threads
are executing independently and interacting with the operating system, the time
it takes for one thread to complete a block of statements varies from execution
to execution, so the order in which these statements complete can’t be
predicted.
In many cases nondeterminism isn’t a problem.
In our example, since we’ve labelled the output with the thread’s rank, the
order in which the output appears prob-ably doesn’t matter. However, there are
also many cases in which nondeterminism— especially in shared-memory
programs—can be disastrous, because it can easily result in program errors.
Here’s a simple example with two threads.
Suppose each thread computes an int, which it stores in a private variable my val. Suppose also that we want to add the values stored in my val into a shared-memory location x that has been initialized to 0. Both threads
therefore want to execute code that looks something like this:
My_val = Compute_val(my_rank);
x += my_val;
Now recall that an addition typically requires
loading the two values to be added into registers, adding the values, and
finally storing the result. To keep things relatively simple, we’ll assume that
values are loaded from main memory directly into registers and stored in main
memory directly from registers. Here is one possible sequence of events:
Clearly this is not what we want, and it’s easy
to imagine other sequences of events that result in an incorrect value for x. The nondeterminism here is a result of the fact that two threads
are attempting to more or less simultaneously update the memory location x. When threads or processes attempt to simultaneously access a resource,
and the accesses can result in an error, we often say the program has a race condition, because the threads or processes are in a “horse race.” That is,
the out-come of the computation depends on which thread wins the race. In our
example, the threads are in a race to execute x += my val. In this case, unless one thread com-pletes x += my val before the other thread starts, the result will be incorrect. A
block of code that can only be executed by one thread at a time is called a critical section, and it’s usually our job as programmers to
insure mutually
exclusive access to the critical section.
In other words, we need to insure that if one thread is executing the code in
the critical section, then the other threads are excluded.
The most commonly used mechanism for insuring
mutual exclusion is a mutual exclusion lock or mutex or lock. A mutex is a special type of object that has support in the underlying hardware. The basic idea is
that each critical section is protected by a lock. Before a thread can execute the code in the critical
section, it must “obtain” the mutex by calling a mutex function, and, when it’s
done executing the code in the critical section, it should “relinquish” the
mutex by calling an unlock function. While one thread “owns” the lock—that is,
has returned from a call to the lock function, but hasn’t yet called the unlock
function—any other thread attempting to execute the code in the critical
section will wait in its call to the lock function.
Thus, in order to insure that our code
functions correctly, we might modify it so that it looks something like this:
my_val = Compute val(my rank); Lock(&add
my_val lock);
x += my_val;
Unlock(&add my_val_lock);
This insures that only one thread at a time can
execute the statement x += my val. Note that the code does not impose any predetermined order on the threads.
Either thread 0 or thread 1 can execute x += my val first.
Also note that the use of a mutex enforces serialization of the critical section. Since only one thread
at a time can execute the code in the critical section, this code is
effectively serial. Thus, we want our code to have as few critical sections as
possible, and we want our critical sections to be as short as possible.
There are alternatives to mutexes. In busy-waiting, a thread enters a loop whose sole purpose is
to test a condition. In our example, suppose there is a shared variable ok for 1 that has been initialized to false. Then something like the
following code can insure that thread 1 won’t update x until after thread 0 has updated it:
my_val = Compute_val(my_rank);
if (my_rank ==
1)
while (!ok_for_1);
/* Busy
wait loop */
x += my_val; /* Critical section */
if (my_rank == 0) /* Let thread 1
update x */
ok_for_1 = true;
So until thread 0 executes ok_for_1 = true, thread 1 will be stuck in the loop while (!ok_for_1). This loop is called a “busy-wait” because the thread can be very busy waiting for the condition. This has the virtue that it’s simple to understand and implement. However, it can be very wasteful of system resources, because even when a thread is doing no useful work, the core running the thread will be repeat-edly checking to see if the critical section can be entered. Semaphores are similar to mutexes, although the details of their behavior are slightly different, and there are
some types of thread synchronization that are
easier to implement with semaphores than mutexes. A monitor provides mutual exclusion at a somewhat higher-level: it is an
object whose methods can only be executed by one thread at a time. We’ll
discuss busy-waiting and semaphores in Chapter 4.
There are a number of other alternatives that
are currently being studied but that
are not yet widely available. The one that has
attracted the most attention is probably transactional memory [31]. In database management systems, a transaction is an access to a database that the system treats as a single unit. For
example, transferring $1000 from your savings account to your checking account
should be treated by your bank’s software as a transaction, so that the
software can’t debit your savings account without also crediting your checking
account. If the software was able to debit your savings account, but was then
unable to credit your checking account, it would roll-back the transaction. In other words, the transaction would either be
fully completed or any partial changes would be erased. The basic idea behind
transactional memory is that critical sections in shared-memory programs should
be treated as transactions. Either a thread successfully completes the critical
section or any partial results are rolled back and the critical section is
repeated.
Thread
safety
In many, if not most, cases parallel programs
can call functions developed for use in serial programs, and there won’t be any
problems. However, there are some notable exceptions. The most important
exception for C programmers occurs in functions that make use of static local variables. Recall that ordinary C local
variables—variables declared inside a function—are allocated from the system
stack. Since each thread has its own stack, ordinary C local variables are
private. However, recall that a static variable that’s declared in a function
persists from one call to the next. Thus, static variables are effectively
shared among any threads that call the function, and this can have unexpected and
unwanted consequences.
For example, the C string library function strtok splits an input string into sub-strings. When it’s first called,
it’s passed a string, and on subsequent calls it returns successive substrings.
This can be arranged through the use of a static char variable that refers to the string that was passed on the first
call. Now suppose two threads are splitting strings into substrings. Clearly,
if, for example, thread 0 makes its first call to strtok, and then thread 1 makes its first call to strtok before thread 0 has completed splitting its string, then thread
0’s string will be lost or overwritten, and,
on subsequent calls it may get substrings of
thread 1’s strings.
A function such as strtok is not thread safe. This means that if it is used in a multithreaded program, there
may be errors or unexpected results. When a block of code isn’t thread safe,
it’s usually because different threads are accessing shared data. Thus, as
we’ve seen, even though many serial functions can be used safely in
multithreaded programs—that is, they’re thread safe—programmers need to be wary of functions that were written
exclusively for use in serial programs. We’ll take a closer look at thread
safety in Chapters 4 and 5.
Distributed-memory
In distributed-memory
programs, the cores can directly access only their own, private memories. There
are several APIs that are used. However, by far the most widely used is
message-passing. So we’ll devote most of our attention in this section to
message-passing. Then we’ll take a brief look at at a couple of other, less
widely used, APIs.
Perhaps
the first thing to note regarding distributed-memory APIs is that they can be
used with shared-memory hardware. It’s perfectly feasible for programmers to
logically partition shared-memory into private address spaces for the various
threads, and a library or compiler can implement the communication that’s
needed.
As we
noted earlier, distributed-memory programs are usually executed by start-ing
multiple processes rather than multiple threads. This is because typical
“threads of execution” in a distributed-memory program may run on independent
CPUs with independent operating systems, and there may be no software
infrastructure for start-ing a single “distributed” process and having that
process fork one or more threads on each node of the system.
Message-passing
A
message-passing API provides (at a minimum) a send and a receive function.
Processes typically identify each other by ranks in the range 0, 1, ....., p 1, where p is the number of processes.
So, for example, process 1 might send a message to process 0 with the following
pseudo-code:
char
message[100];
. . .
My_rank
= Get_rank();
if
(my_rank == 1) {
sprintf(message,
"Greetings from process 1");
Send(message,
MSG_CHAR, 100, 0); { else if (my_rank == 0) {
Receive(message,
MSG_CHAR, 100, 1); printf("Process 0 > Received: %s\n", message);
}
Here the
Get rank function returns the calling
process’ rank. Then the processes branch depending on their ranks. Process 1
creates a message with sprintf from the standard C library
and then sends it to process 0 with the call to Send. The arguments to the call are, in order, the
message, the type of the elements in the mes-sage (MSG CHAR), the number of elements in the message (100),
and the rank of the destination process (0). On the other hand, process 0 calls
Receive with the follow-ing
arguments: the variable into which the message will be received (message), the type of the message elements, the number
of elements available for storing the mes-sage, and the rank of the process
sending the message. After completing the call to
Receive,
process 0 prints the message.
Several
points are worth noting here. First note that the program segment is SPMD. The
two processes are using the same executable, but carrying out differ-ent
actions. In this case, what they do depends on their ranks. Second, note that
the variable message refers to different blocks
of memory on the different pro-cesses. Programmers often stress this by using
variable names such as my
message or local
message. Finally, note that we’re assuming that process 0 can write to stdout. This is usually the case: most
implementations of message-passing APIs allow all processes access to stdout and stderr—even if the API doesn’t explicitly provide for
this. We’ll talk a little more about I/O later on.
There
are several possibilities for the exact behavior of the Send and Receive functions, and most message-passing APIs
provide several different send and/or receive functions. The simplest behavior
is for the call to Send to block until the call to Receive starts receiving the data. This means that the
process calling Send won’t return from the call
until the matching call to Receive has started. Alternatively,
the Send function may copy the
contents of the message into storage that it owns, and then it will return as
soon as the data is copied. The most common behavior for the Receive function is for the receiving process to block
until the message is received. There are other possibilities for both Send and Receive, and we’ll discuss some of them in Chapter 3.
Typical
message-passing APIs also provide a wide variety of additional func-tions. For
example, there may be functions for various “collective” communications, such
as a broadcast, in which a single process transmits the same
data to all the processes, or a reduction, in which results computed by the individual
pro-cesses are combined into a single result—for example, values computed by
the processes are added. There may also be special functions for managing
processes and communicating complicated data structures. The most widely used
API for message-passing is the Message-Passing Interface or MPI. We’ll take a closer
look at it in Chapter 3.
Message-passing is a very powerful and versatile API for developing
parallel pro-grams. Virtually all of the programs that are run on the most
powerful computers in the world use message-passing. However, it is also very
low level. That is, there is a huge amount of detail that the programmer needs
to manage. For example, in order to parallelize a serial program, it is usually
necessary to rewrite the vast majority of the program. The data structures in
the program may have to either be replicated by each process or be explicitly
distributed among the processes. Furthermore, the rewriting usually can’t be
done incrementally. For example, if a data structure is used in many parts of
the program, distributing it for the parallel parts and collecting it for the
serial (unparallelized) parts will probably be prohibitively expensive.
Therefore, message-passing is sometimes called “the assembly language of
parallel programming,” and there have been many attempts to develop other
distributed-memory APIs.
One-sided communication
In message-passing,
one process, must call a send function and the send must be matched by another
process’ call to a receive function. Any communication requires the explicit
participation of two processes. In one-sided communication, or remote memory access, a single process calls a
function, which updates either local mem-ory with a value from another process
or remote memory with a value from the calling process. This can simplify
communication, since it only requires the active participation of a single process.
Furthermore, it can significantly reduce the cost of communication by
eliminating the overhead associated with synchronizing two pro-cesses. It can
also reduce overhead by eliminating the overhead of one of the function calls
(send or receive).
It
should be noted that some of these advantages may be hard to realize in
prac-tice. For example, if process 0 is copying a value into the memory of
process 1, 0 must have some way of knowing when it’s safe to copy, since it
will overwrite some memory location. Process 1 must also have some way of
knowing when the mem-ory location has been updated. The first problem can be
solved by synchronizing the two processes before the copy, and the second
problem can be solved by another synchronization or by having a “flag” variable
that process 0 sets after it has com-pleted the copy. In the latter case,
process 1 may need to poll the flag variable in order to determine that
the new value is available. That is, it must repeatedly check the flag variable
until it gets the value indicating 0 has completed its copy. Clearly, these
problems can considerably increase the overhead associated with transmitting a
value. A further difficulty is that since there is no explicit interaction
between the two processes, remote memory operations can introduce errors that
are very hard to track down.
Partitioned global address space languages
Since
many programmers find shared-memory programming more appealing than
message-passing or one-sided communication, a number of groups are developing
parallel programming languages that allow the user to use some shared-memory
tech-niques for programming distributed-memory hardware. This isn’t quite as
simple as it sounds. If we simply wrote a compiler that treated the collective
memories in a distributed-memory system as a single large memory, our programs
would have poor, or, at best, unpredictable performance, since each time a
running process accessed memory, it might access local memory—that is, memory
belonging to the core on which it was executing—or remote memory, memory
belonging to another core. Accessing remote memory can take hundreds or even
thousands of times longer than accessing local memory. As an example, consider
the following pseudo-code for a shared-memory vector addition:
shared
int n = . . . ;
shared
double x[n], y[n];
private
int i, my_first_element, my_last.element; my_first_element = . . . ;
my_last_element
= . . . ;
/* Initialize x and y */
. . .
for (i =
my_first_element; i <= my_last_element; i++)
x[i] +=
y[i];
We first
declare two shared arrays. Then, on the basis of the process’ rank, we
determine which elements of the array “belong” to which process. After
initializ-ing the arrays, each process adds its assigned elements. If the
assigned elements of x and y have been allocated so that the elements
assigned to each process are in the
memory attached to the core the process is running on, then this code should be
very fast. However, if, for example, all of x is assigned to core 0 and all of y is assigned to core 1, then the performance is
likely to be terrible, since each time the assignment x[i] += y[i] is executed, the process will need to refer to
remote memory.
Partitioned global address space, or PGAS, languages provide
some of the mechanisms of shared-memory programs. However,
they provide the programmer with tools to avoid the problem we just discussed.
Private variables are allocated in the local memory of the core on which the
process is executing, and the distribu-tion of the data in shared data structures
is controlled by the programmer. So, for example, she knows which elements of a
shared array are in which process’ local memory.
There
are a number of research projects currently working on the development of PGAS
languages. See, for example, [7, 9, 45].
Programming hybrid systems
Before
moving on, we should note that it is possible to program systems such as
clusters of multicore processors using a combination of a shared-memory API on
the nodes and a distributed-memory API for internode communication. However,
this is usually only done for programs that require the highest possible levels
of performance, since the complexity of this “hybrid” API makes program
develop-ment extremely difficult. See, for example, [40]. Rather, such systems
are usually programmed using a single, distributed-memory API for both inter-
and intra-node communication.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.