Chapter: An Introduction to Parallel Programming : Shared-Memory Programming with Pthreads

Thread-Safety

Let’s look at another potential problem that occurs in shared-memory programming: thread-safety. A block of code is thread-safe if it can be simultaneously executed by multiple threads without causing problems.

THREAD-SAFETY

Let’s look at another potential problem that occurs in shared-memory programming: thread-safety. A block of code is thread-safe if it can be simultaneously executed by multiple threads without causing problems.

As an example, suppose we want to use multiple threads to “tokenize” a file. Let’s suppose that the file consists of ordinary English text, and that the tokens are just contiguous sequences of characters separated from the rest of the text by white space—a space, a tab, or a newline. A simple approach to this problem is to divide the input file into lines of text and assign the lines to the threads in a round-robin fashion: the first line goes to thread 0, the second goes to thread 1, . . . , the tth goes to thread t, the t + 1st goes to thread 0, and so on.

We can serialize access to the lines of input using semaphores. Then, after a thread has read a single line of input, it can tokenize the line. One way to do this is to use the strtok function in string.h, which has the following prototype:

char* strtok(

char* string /* in/out */,

const char* separators /* in */);

Its usage is a little unusual: the first time it’s called the string argument should be the text to be tokenized, so in our example it should be the line of input. For subsequent calls, the first argument should be NULL. The idea is that in the first call, strtok caches a pointer to string, and for subsequent calls it returns successive tokens taken from the cached copy. The characters that delimit tokens should be passed in separators. We should pass in the string " /t/n" as the separators argument.

Program 4.14: A first attempt at a multithreaded tokenizer

void Tokenize(void rank) {

long my_rank = (long) rank;

int count;

int next = (my_rank + 1) % thread_count;

char *fg_rv;

char my_line[MAX];

char * my_string;

sem_wait(&sems[my_rank]);

fg_rv = fgets(my_line, MAX, stdin);

sem_post(&sems[next]);

while (fg_rv != NULL) f

printf("Thread %ld > my_line = %s", my_rank, my_line);

count = 0;

my_string = strtok(my_line, " \t\n");

while ( my_string != NULL ) f

count++;

printf("Thread %ld > string %d = %snn", my_rank, count,

my_string);

my_string = strtok(NULL, " \t\n");

}

sem_wait(&sems[my rank]);

fg_rv = fgets(my_line, MAX, stdin);

sem_post(&sems[next]);

}

return NULL;

} /* Tokenize */

Given these assumptions, we can write the thread function shown in Pro-gram 4.14. The main thread has initialized an array of t semaphores—one for each thread. Thread 0’s semaphore is initialized to 1. All the other semaphores are initialized to 0. So the code in Lines 9 to 11 will force the threads to sequentially access the lines of input. Thread 0 will immediately read the first line, but all the other threads will block in sem wait. When thread 0 executes the sem post, thread 1 can read a line of input. After each thread has read its first line of input (or end-of-file), any additional input is read in Lines 24 to 26. The fgets function reads a single line of input and Lines 15 to 22 identify the tokens in the line. When we run the program with a single thread, it correctly tokenizes the input stream. The first time we run it with two threads and the input

Pease porridge hot.

Pease porridge cold.

Pease porridge in the pot

Nine days old.

the output is also correct. However, the second time we run it with this input, we get the following output.

Thread 0 > my line = Pease porridge hot.

Thread 0 > string 1 = Pease

Thread 0 > string 2 = porridge

Thread 0 > string 3 = hot.

Thread 1 > my line = Pease porridge cold.

Thread 0 > my line = Pease porridge in the pot

Thread 0 > string 1 = Pease

Thread 0 > string 2 = porridge

Thread 0 > string 3 = in

Thread 0 > string 4 = the

Thread 0 > string 5 = pot

Thread 1 > string 1 = Pease

Thread 1 > my line = Nine days old.

Thread 1 > string 1 = Nine

Thread 1 > string 2 = days

Thread 1 > string 3 = old.

What happened? Recall that strtok caches the input line. It does this by declaring a variable to have static storage class. This causes the value stored in this variable to persist from one call to the next. Unfortunately for us, this cached string is shared, not private. Thus, thread 0’s call to strtok with the third line of the input has apparently overwritten the contents of thread 1’s call with the second line.

The strtok function is not thread-safe: if multiple threads call it simultaneously, the output it produces may not be correct. Regrettably, it’s not uncommon for C library functions to fail to be thread-safe. For example, neither the random num-ber generator random in stdlib.h nor the time conversion function localtime in time.h is thread-safe. In some cases, the C standard specifies an alternate, thread-safe version of a function. In fact, there is a thread-safe version of strtok:

char* strtok_r(

char* string /* in/out */,

const char* separators /* in */,

char** saveptr_p /* in/out */);

The “_r” is supposed to suggest that the function is reentrant, which is sometimes used as a synonym for thread-safe. The first two arguments have the same purpose as the arguments to strtok. The saveptr Append ‘‘_p’’ to ‘‘saveptr’’ argument is used by strtok r for keeping track of where the function is in the input string; it serves the purpose of the cached pointer in strtok. We can correct our original Tokenize function by replacing the calls to strtok with calls to strtok r. We sim-ply need to declare a char variable to pass in for the third argument, and replace the calls in Line 16 and Line 21 with the calls

my_string = strtok_r(my_line, " \t\n", &saveptr);

. . .

my_string = strtok_r(NULL, " \t\n", &saveptr);

respectively.

1. Incorrect programs can produce correct output

Notice that our original version of the tokenizer program shows an especially insid-ious form of program error: the first time we ran it with two threads, the program produced correct output. It wasn’t until a later run that we saw an error. This, unfor-tunately, is not a rare occurrence in parallel programs. It’s especially common in shared-memory programs. Since, for the most part, the threads are running indepen-dently of each other, as we noted earlier, the exact sequence of statements executed is nondeterministic. For example, we can’t say when thread 1 will first call strtok. If its first call takes place after thread 0 has tokenized its first line, then the tokens identified for the first line should be correct. However, if thread 1 calls strtok before thread 0 has finished tokenizing its first line, it’s entirely possible that thread 0 may not identify all the tokens in the first line. Therefore, it’s especially important in developing shared-memory programs to resist the temptation to assume that since a program produces correct output, it must be correct. We always need to be wary of race conditions.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

An Introduction to Parallel Programming : Shared-Memory Programming with Pthreads : Thread-Safety |