## Chapter: An Introduction to Parallel Programming : Shared-Memory Programming with Pthreads

Let’s take a look at the problem of controlling access to a large, shared data struc-ture, which can be either simply searched or updated by the threads. For the sake of explicitness, let’s suppose the shared data structure is a sorted linked list of ints, and the operations of interest are Member, Insert, and Delete.

The list itself is composed of a collection of list nodes, each of which is a struct with two members: an int and a pointer to the next node. We can define such a struct with the definition

struct list_node s { int data;

struct list_node_s*      next;

} Program 4.9: The Member function

int  Member(int value, struct list_node_s*  head_p) {

while (curr_p != NULL && curr_p ->data < value)

curr_p = curr_p ->next;

if (curr_p == NULL  j j  curr_p ->data > value) {

return 0;

} else {

return 1;

}

}  /*  Member  */

A typical list is shown in Figure 4.4. A pointer, head_p, with type struct list_node_s* refers to the first node in the list. The next member of the last node is NULL (which is indicated by a slash (/ ) in the next member).

The Member function (Program 4.9) uses a pointer to traverse the list until it either finds the desired value or determines that the desired value cannot be in the list. Since the list is sorted, the latter condition occurs when the curr p pointer is NULL or when the data member of the current node is larger than the desired value.

The Insert function (Program 4.10) begins by searching for the correct position in which to insert the new node. Since the list is sorted, it must search until it finds a node whose data member is greater than the value to be inserted. When it finds this node, it needs to insert the new node in the position preceding the node that’s been found. Since the list is singly-linked, we can’t “back up” to this position without traversing the list a second time. There are several approaches to dealing with this: the approach we use is to define a second pointer pred p, which, in general, refers to the predecessor of the current node. When we exit the loop that searches for the position to insert, the next member of the node referred to by pred p can be updated so that it refers to the new node. See Figure 4.5.

The Delete function (Program 4.11) is similar to the Insert function in that it also needs to keep track of the predecessor of the current node while it’s searching for the node to be deleted. The predecessor node’s next member can then be updated after the search is completed. See Figure 4.6.

Program 4.10: The Insert function

int Insert(int value, struct list_node_s**   head_p) {

struct list_node_s*         curr_p         =       head p;

struct list_node_s*         pred_p        =       NULL;

struct list_node_s*  temp_p;

while (curr_p != NULL && curr_p ->data < value) {

pred_p = curr_p;

curr_p = curr_p ->next;

}

if (curr_p == NULL ||  curr p ->data > value) {

temp_p = malloc(sizeof(struct list_node_s));

temp_p >data = value;

temp_p >next = curr p;

if (pred_p == NULL)     /*       New first node      */

else

pred_p >next = temp_p;

return 1;

} else { /*  Value already in list  */

return 0;

}

}  /*  Insert  */ Now let’s try to use these functions in a Pthreads program. In order to share access to the list, we can define head p to be a global variable. This will simplify the function headers for Member, Insert, and Delete, since we won’t need to pass in either head p or a pointer to head p, we’ll only need to pass in the value of interest.

int Delete(int value, struct list_node_s**   head p)        {

struct list node s   * pred_p     = NULL;

while (curr_p != NULL && curr_p ->data < value)      {

pred_p = curr_p;

curr_p = curr_p ->next;

}

if (curr_p != NULL && curr_ p ->data == value) {

if (pred_p == NULL) { /*  Deleting first node in list  */

free(curr_p);

g else f

pred_p ->next = curr_p ->next;

free(curr_p);

}

return 1;

} else {  /*  Value isn’t in list  */

return 0;

}

}  /*  Delete  */ What now are the consequences of having multiple threads simultaneously execute the three functions?

Since multiple threads can simultaneously read a memory location without con-flict, it should be clear that multiple threads can simultaneously execute Member. On the other hand, Delete and Insert also write to memory locations, so there may be problems if we try to execute either of these operations at the same time as another operation. As an example, suppose that thread 0 is executing Member (5) at the same time that thread 1 is executing Delete (5). The current state of the list is shown in Figure 4.7. An obvious problem is that if thread 0 is executing Member (5), it is going to report that 5 is in the list, when, in fact, it may be deleted even before thread 0 returns. A second obvious problem is if thread 0 is executing Member (8), thread 1 may free the memory used for the node storing 5 before thread 0 can advance to the node storing 8. Although typical implementations of free don’t overwrite the freed memory, if the memory is reallocated before thread 0 advances, there can be serious problems. For example, if the memory is reallocated for use in something other than a list node, what thread 0 “thinks” is the next member may be set to utter garbage, and after it executes

curr_p = curr_p >next;

dereferencing curr_p may result in a segmentation violation.

More generally, we can run into problems if we try to simultaneously execute another operation while we’re executing an Insert or a Delete. It’s OK for multi-ple threads to simultaneously execute Member—that is, read the list nodes—but it’s unsafe for multiple threads to access the list if at least one of the threads is executing an Insert or a Delete—that is, is writing to the list nodes (see Exercise 4.11).

How can we deal with this problem? An obvious solution is to simply lock the list any time that a thread attempts to access it. For example, a call to each of the three functions can be protected by a mutex, so we might execute

Member(value);

An equally obvious problem with this solution is that we are serializing access to the list, and if the vast majority of our operations are calls to Member, we’ll fail to exploit this opportunity for parallelism. On the other hand, if most of our operations are calls to Insert and Delete, then this may be the best solution, since we’ll need to serialize access to the list for most of the operations, and this solution will certainly be easy to implement.

An alternative to this approach involves “finer-grained” locking. Instead of lock-ing the entire list, we could try to lock individual nodes. We would add, for example, a mutex to the list node struct:

struct list_node_s { int data;

struct list_node_s*  next;

}

Now each time we try to access a node we must first lock the mutex associated with the node. Note that this will also require that we have a mutex associated with the head p pointer. So, for example, we might implement Member as shown in Program 4.12. Admittedly this implementation is much more complex than the original Member function. It is also much slower, since, in general, each time a node is accessed, a mutex must be locked and unlocked. At a minimum it will add two function calls to the node access, but it can also add a substantial delay if a thread has to wait for a lock. A further problem is that the addition of a mutex field to each node will substantially increase the amount of storage needed for the list. On the other hand, the finer-grained locking might be a closer approximation to what we want. Since we’re only locking the nodes of current interest, multiple threads can simul-taneously access different parts of the list, regardless of which operations they’re executing.

Program 4.12: Implementation of Member with one mutex per list node Using Pthreads read-write locks, we can protect our linked list functions with the following code (we’re ignoring function return values):

Member(value);

. . .

. . .

The syntax for the new Pthreads functions is

As their names suggest, the first function locks the read-write lock for reading, the second locks it for writing, and the last unlocks it.

As with mutexes, read-write locks should be initialized before use and destroyed after use. The following function can be used for initialization:

const pthread_rwlockattr_t     attr_p         /*       in       */);

Also as with mutexes, we’ll not use the second argument, so we’ll just pass NULL. The following function can be used for destruction of a read-write lock:

4. Performance of the various implementations

Of course, we really want to know which of the three implementations is “best,” so we included our implementations in a small program in which the main thread first inserts a user-specified number of randomly generated keys into an empty list. After being started by the main thread, each thread carries out a user-specified number of operations on the list. The user also specifies the percentages of each type of oper-ation (Member, Insert, Delete). However, which operation occurs when and on which key is determined by a random number generator. Thus, for example, the user might specify that 1000 keys should be inserted into an initially empty list and a total of 100,000 operations are to be carried out by the threads. Further, she might specify that 80% of the operations should be Member, 15% should be Insert, and the remaining 5% should be Delete. However, since the operations are randomly generated, it might happen that the threads execute a total of, say, 79,000 calls to Member, 15,500 calls to Insert, and 5500 calls to Delete.

Tables 4.3 and 4.4 show the times (in seconds) that it took for 100,000 operations on a list that was initialized to contain 1000 keys. Both sets of data were taken on a system containing four dual-core processors.

Table 4.3 shows the times when 99.9% of the operations are Member and the remaining 0.1% are divided equally between Insert and Delete. Table 4.4 shows the times when 80% of the operations are Member, 10% are Insert, and 10% are Delete. Note that in both tables when one thread is used, the run-times for the  read-write locks and the single-mutex implementations are about the same. This makes sense: the operations are serialized, and since there is no contention for the read-write lock or the mutex, the overhead associated with both implementations should consist of a function call before the list operation and a function call after the operation. On the other hand, the implementation that uses one mutex per node is much slower. This also makes sense, since each time a single node is accessed there will be two function calls—one to lock the node mutex and one to unlock it. Thus, there’s considerably more overhead for this implementation.

The inferiority of the implementation that uses one mutex per node persists when we use multiple threads. There is far too much overhead associated with all the locking and unlocking to make this implementation competitive with the other two implementations.

Perhaps the most striking difference between the two tables is the relative perfor-mance of the read-write lock implementation and the single-mutex implementation when multiple threads are used. When there are very few Inserts and Deletes, the read-write lock implementation is far better than the single-mutex implementation. Since the single-mutex implementation will serialize all the operations, this suggests that if there are very few Inserts and Deletes, the read-write locks do a very good job of allowing concurrent access to the list. On the other hand, if there are a relatively large number of Inserts and Deletes (for example, 10% each), there’s very little difference between the performance of the read-write lock implementation and the single-mutex implementation. Thus, for linked list operations, read-write locks can provide a considerable increase in performance, but only if the number of Inserts and Deletes is quite small.

Also notice that if we use one mutex or one mutex per node, the program is always as fast or faster when it’s run with one thread. Furthermore, when the number of inserts and deletes is relatively large, the read-write lock program is also faster with one thread. This isn’t surprising for the one mutex implemen-tation, since effectively accesses to the list are serialized. For the read-write lock implementation, it appears that when there are a substantial number of write-locks, there is too much contention for the locks and overall performance deteriorates significantly.

In summary, the read-write lock implementation is superior to the one mutex and the one mutex per node implementations. However, unless the number of inserts and deletes is small, a serial implementation will be superior.

The original Pthreads specification didn’t include read-write locks, so some of the early texts describing Pthreads include implementations of read-write locks (see, for example, ). A typical implementation6 defines a data structure that uses two condi-tion variables—one for “readers” and one for “writers”—and a mutex. The structure also contains members that indicate

how many readers are waiting to obtain the lock,

whether a writer owns the lock, and

how many writers are waiting to obtain the lock.