Chapter: An Introduction to Parallel Programming : Shared-Memory Programming with Pthreads

Busy-Waiting

BUSY-WAITING

When, say, thread 0 wants to execute the statement x = x + y, it needs to first make sure that thread 1 is not already executing the statement. Once thread 0 makes sure of this, it needs to provide some way for thread 1 to determine that it, thread 0, is executing the statement, so that thread 1 won’t attempt to start executing the statement until thread 0 is done. Finally, after thread 0 has completed execution of the statement, it needs to provide some way for thread 1 to determine that it is done, so that thread 1 can safely start executing the statement.

A simple approach that doesn’t involve any new concepts is the use of a flag variable. Suppose flag is a shared int that is set to 0 by the main thread. Further, suppose we add the following code to our example:

y = Compute(my_rank);

while (flag != my_rank);

x = x + y;

flag++;

Let’s suppose that thread 1 finishes the assignment in Line 1 before thread 0. What happens when it reaches the while statement in Line 2? If you look at the while statement for a minute, you’ll see that it has the somewhat peculiar property that its body is empty. So if the test flag != my rank is true, then thread 1 will just execute the test a second time. In fact, it will keep re-executing the test until the test is false. When the test is false, thread 1 will go on to execute the code in the critical section

x = x + y.

Since we’re assuming that the main thread has initialized flag to 0, thread 1 won’t proceed to the critical section in Line 3 until thread 0 executes the statement flag++. In fact, we see that unless some catastrophe befalls thread 0, it will eventually catch up to thread 1. However, when thread 0 executes its first test of flag != my rank, the condition is false, and it will go on to execute the code in the critical section x = x + y. When it’s done with this, we see that it will execute flag++, and thread 1 can finally enter the critical section.

The key here is that thread 1 cannot enter the critical section until thread 0

has completed the execution of flag++. And, provided the statements are executed exactly as they’re written, this means that thread 1 cannot enter the critical section until thread 0 has completed it.

The while loop is an example of busy-waiting. In busy-waiting, a thread repeat-edly tests a condition, but, effectively, does no useful work until the condition has the appropriate value (false in our example).

Note that we said that the busy-wait solution would work “provided the statements are executed exactly as they’re written.” If compiler optimization is turned on, it is possible that the compiler will make changes that will affect the correctness of busy-waiting. The reason for this is that the compiler is unaware that the program is multithreaded, so it doesn’t “know” that the variables x and flag can be modified by another thread. For example, if our code

y = Compute(my_rank); while (flag != my_rank); x = x + y;

flag++;

is run by just one thread, the order of the statements while (flag != my rank) and x = x + y is unimportant. An optimizing compiler might therefore determine that the program would make better use of registers if the order of the statements were switched. Of course, this will result in the code

y = Compute(my_rank); x = x + y;

while (flag != my_rank); flag++;

which defeats the purpose of the busy-wait loop. The simplest solution to this problem is to turn compiler optimizations off when we use busy-waiting. For an alternative to completely turning off optimizations, see Exercise 4.3.

We can immediately see that busy-waiting is not an ideal solution to the problem of controlling access to a critical section. Since thread 1 will execute the test over and over until thread 0 executes flag++, if thread 0 is delayed (for example, if the operating system preempts it to run something else), thread 1 will simply “spin” on the test, eating up CPU cycles. This can be positively disastrous for performance. Turning off compiler optimizations can also seriously degrade performance.

Before going on, though, let’s return to our calculation program in Figure 4.3 and correct it by using busy-waiting. The critical section in this function is Line 15. We can therefore precede this with a busy-wait loop. However, when a thread is done with the critical section, if it simply increments flag, eventually flag will be greater than t, the number of threads, and none of the threads will be able to return to the critical section. That is, after executing the critical section once, all the threads will be stuck forever in the busy-wait loop. Thus, in this instance, we don’t want to simply increment flag. Rather, the last thread, thread t 1, should reset flag to zero. This can be accomplished by replacing flag++ with

flag = (flag + 1) % thread count;

With this change, we get the thread function shown in Program 4.4. If we compile the program and run it with two threads, we see that it is computing the correct results. However, if we add in code for computing elapsed time, we see that when n D 10⁸, the serial sum is consistently faster than the parallel sum. For example, on the dual-core system, the elapsed time for the sum as computed by two threads is about 19.5 seconds, while the elapsed time for the serial sum is about 2.8 seconds!

Program 4.4: Pthreads global sum with busy-waiting

void Thread_sum(void* rank) {

long my_rank = (long) rank;

double factor;

long long i;

long long_my_n = n/thread_count;

long long my_first_i = my_n* my_rank;

long long my_last_i = my_first_i + my_n;

if (my_first_i % 2 == 0)

factor = 1.0;

else

factor = - 1.0;

for (i = my_first_i; i < my_last_i; i++, factor =- factor) {

while (flag != my_rank);

sum += factor/(2* i+1);

flag = (flag+1) % thread_count;

}

return NULL;

} /* Thread sum */

Why is this? Of course, there’s overhead associated with starting up and joining the threads. However, we can estimate this overhead by writing a Pthreads program in which the thread function simply returns:

void* Thread_function(void* ignore) { return NULL;

} /* Thread function */

When we find the time that’s elapsed between starting the first thread and joining the second thread, we see that on this particular system, the overhead is less than 0.3 milliseconds, so the slowdown isn’t due to thread overhead. If we look closely at the thread function that uses busy-waiting, we see that the threads alternate between executing the critical section code in Line 16. Initially flag is 0, so thread 1 must wait until thread 0 executes the critical section and increments flag. Then, thread 0 must wait until thread 1 executes and increments. The threads will alternate between waiting and executing, and evidently the waiting and the incrementing increase the overall run time by a factor of seven.

Program 4.5: Global sum function with critical section after loop

void* Thread_sum(void* rank) { long my_rank = (long) rank; double factor, my_sum = 0.0;

long long i;

long long my_n = n/thread_count; long long_my_first i = my_n *my_rank;

long long my_last_i = my_first_i + my_n;

if (my_first i % 2 == 0) factor = 1.0;

else

factor = -1.0;

for (i = my_first i; i < my_last i; i++, factor = factor) my_sum += factor/(2* i+1);

while (flag != my_rank); sum += my_sum;

flag = (flag+1) % thread_count;

return NULL;

} /* Thread sum */

As we’ll see, busy-waiting isn’t the only solution to protecting a critical section. In fact, there are much better solutions. However, since the code in a critical section can only be executed by one thread at a time, no matter how we limit access to the critical section, we’ll effectively serialize the code in the critical section. Therefore, if it’s at all possible, we should minimize the number of times we execute critical section code. One way to greatly improve the performance of the sum function is to have each thread use a private variable to store its total contribution to the sum. Then, each thread can add in its contribution to the global sum once, after the for loop. See Program 4.5. When we run this on the dual-core system with n = 10⁸, the elapsed time is reduced to 1.5 seconds for two threads, a substantial improvement.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Shared-Memory Programming with Pthreads : Busy-Waiting |