Let’s look at another potential problem that occurs in shared-memory programming: thread-safety. A block of code is thread-safe if it can be simultaneously executed by multiple threads without causing problems.
As an example, suppose we want to use multiple threads to “tokenize” a file. Let’s suppose that the file consists of ordinary English text, and that the tokens are just contiguous sequences of characters separated from the rest of the text by white space—spaces, tabs, or newlines. A simple approach to this problem is to divide the input file into lines of text and assign the lines to the threads in a round-robin fashion: the first line goes to thread 0, the second goes to thread 1, . . . , the tth goes to thread t, the t + 1st goes to thread 0, and so on.
We’ll read the text into an array of strings, with one line of text per string. Then we can use a parallel for directive with a schedule(static,1) clause to divide the lines among the threads.
One way to tokenize a line is to use the strtok function in string.h. It has the following prototype:
char* string /*in/out */,
const char* separators /* in */);
Its usage is a little unusual: the first time it’s called, the string argument should be the text to be tokenized, so in our example it should be the line of input. For subsequent calls, the first argument should be NULL. The idea is that in the first call, strtok caches a pointer to string, and for subsequent calls it returns successive tokens taken from the cached copy. The characters that delimit tokens should be passed in separators, so we should pass in the string " ntnn" as the separators argument.
Program 5.6: A first attempt at a multi threaded tokenizer
Given these assumptions, we can write the Tokenize function shown in Program 5.6. The main function has initialized the array lines so that it contains the input text, and line count is the number of strings stored in lines. Although for our purposes, we only need the lines argument to be an input argument, the strtok function modifies its input. Thus, when Tokenize returns, lines will be modified. When we run the program with a single thread, it correctly tokenizes the input stream. The first time we run it with two threads and the input
Pease porridge hot.
Pease porridge cold.
Pease porridge in the pot
Nine days old.
the output is also correct. However, the second time we run it with this input, we get the following output.
Thread 0 > line 0 = Pease porridge hot. Thread 1 > line 1 = Pease porridge cold. Thread 0 > token 0 = Pease
Thread 1 > token 0 = Pease
Thread 0 > token 1 = porridge
Thread 1 > token 1 = cold.
Thread 0 > line 2 = Pease porridge in the pot
Thread 1 > line 3 = Nine days old.
Thread 0 > token 0 = Pease
Thread 1 > token 0 = Nine
Thread 0 > token 1 = days
Thread 1 > token 1 = old.
What happened? Recall that strtok caches the input line. It does this by declaring a variable to have static storage class. This causes the value stored in this variable to persist from one call to the next. Unfortunately for us, this cached string is shared, not private. Thus, it appears that thread 1’s call to strtok with the second line has apparently overwritten the contents of thread 0’s call with the first line. Even worse, thread 0 has found a token (“days”) that should be in thread 1’s output.
The strtok function is therefore not thread-safe: if multiple threads call it simultaneously, the output it produces may not be correct. Regrettably, it’s not uncommon for C library functions to fail to be thread-safe. For example, neither the random number generator random in stdlib.h nor the time conversion func-tion localtime in time.h is thread-safe. In some cases, the C standard specifies an alternate, thread-safe, version of a function. In fact, there is a thread-safe version of strtok:
The “ r” is supposed to suggest that the function is re-entrant, which is sometimes used as a synonym for thread-safe. The first two arguments have the same purpose as the arguments to strtok. The saveptr p argument is used by strtok r for keeping track of where the function is in the input string; it serves the purpose of the cached pointer in strtok. We can correct our original Tokenize function by replacing the calls to strtok with calls to strtok r. We simply need to declare a char* variable to pass in for the third argument, and replace the calls in Line 17 and Line 20 with the calls
my_token = strtok_r(lines[i], " \t\n", &saveptr);
. . .
my_token = strtok_r(NULL, " \t\n", &saveptr);
1. Incorrect programs can produce correct output
Notice that our original version of the tokenizer program shows an especially insidi-ous form of program error: The first time we ran it with two threads, the program produced correct output. It wasn’t until a later run that we saw an error. This, unfortunately, is not a rare occurrence in parallel programs. It’s especially common in shared-memory programs. Since, for the most part, the threads are running inde-pendently of each other, as we noted back at the beginning of the chapter, the exact sequence of statements executed is nondeterministic. For example, we can’t say when thread 1 will first call strtok. If its first call takes place after thread 0 has tokenized its first line, then the tokens identified for the first line should be correct. However, if thread 1 calls strtok before thread 0 has finished tokenizing its first line, it’s entirely possible that thread 0 may not identify all the tokens in the first line, so it’s especially important in developing shared-memory programs to resist the temptation to assume that since a program produces correct output, it must be correct. We always need to be wary of race conditions.