Synchronization and various Hardware Primitives
Synchronization mechanisms are typically built with user-level software routines that rely on hardware-supplied synchronization instructions. The efficient spin locks can be built using a simple hardware synchronization instruction and the coherence mechanism.
2. Basic Hardware Primitives
The key ability we require to implement synchronization in a multiprocessor is a set of hardware primitives with the ability to atomically read and modify a memory location. Without such a capability, the cost of building basic synchronization primitives wil l be too high and will increase as the processor count increases.
There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically.
These hardware primitives are the basic building blocks that are used to build a wide variety of user-level synchronization operations, including things such as locks and barriers.
One typical operation for building synchronization operations is the atomic exchange, which interchanges a value in a register for a value in memory.
Use this to build a basic synchronization operation, assume that we want to build a
Simple lock where the value 0 is used to indicate that the lock is free and a 1 is used to indicate that the lock is unavailable.
A processor tries to set the lock by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 if some other processor had already claimed access and 0 otherwise. In the latter case, the value is also changed to be 1, preventing any competing exchange from also retrieving a 0.
There are a number of other atomic primitives that can be used to implement synchronization. They all have the key property that they read and update a memory value in such a manner that we can tell whether or not the two operations executed atomically. One operation, present in many older multiprocessors, is test-and-set, which tests a value and sets it if the value passes the test. For example, we could define an operation that tested for 0 and set the value to 1, which can be used in a fashion similar to how we used atomic exchange.
Another atomic synchronization primitive is fetch-and-increment: it returns the value of a memory location and atomically increments it. By using the value 0 to indicate that the synchronization variable is unclaimed, we can use fetch-and-increment, just as we used exchange. There are other uses of operations like fetch-and-increment.
3. Implementing Locks Using Coherence
We can use the coherence mechanisms of a multiprocessor to implement spin locks: locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. Spin locks are used when a programmer expects the lock to be held for a very short amount of time and when she wants the process of locking to be low latency when the lock is available. Because spin locks tie up the processor, waiting in a loop for the lock to become free, they are inappropriate in some circumstances.
The simplest implementation, which we would use if there were no cache coherence, supports cache coherence, we can maintain the lock value coherent an implementation where the proca tight loop) could be done on a lo would keep the lock variables in memory. A processor could continually try to acquire the lock using an atomic operation, say exchange, and test whether the exchange returned the lock as free. To release the lock, the processor simply stores the value 0 to the lock. Here is the code sequence to lock a spin lock whose address is in R1 using an atomic exchange:
lockit: EXCH R2,0(R1) ; atomic exchange
BNEZ R2,lockit ; already locked?
If our multiprocessor supports cache coherence, we can cache the locks using the
coherence mechanism to maintain the lock value coherently. Caching locks has two advantages.
First, it allows an implementation where the process of “spinning” (trying to test and acquire the lock in a tight loop) could be done on a local cached copy rather than requiring a global memory access on each attempt to acquire the lock.
The second advantage comes from the observation that there is often locality in lock accesses: that is, the processor that used the lock last will use it again in the near future. In such cases, the lock value may reside in the cache of that processor, greatly reducing the time to acquire the lock..
4. Synchronization Performance Challenges
One additional common synchronization operation in programs with parallel loops is a barrier. A barrier forces all processes to wait until all the processes reach the barrier and then releases all of the processes. A typical implementation of a barrier can be done with two spin locks: one used to protect a counter that tallies the proces ses arriving at the barrier and one used to hold the processes until the last process arrives at the barrier.
Synchronization Mechanisms for Larger-Scale Multiprocessors
The major difficulty with our spin-lock implementation is the delay due to contention when many processes are spinning on the lock. One solution is to artificially delay processes when they fail to acquire the lock. The best performance is obtained by increasing the delay exponentially whenever the attempt to acquire the lock fails.
Figure 3.7 shows how a spin lock with exponential back-off is implemented. Exponential back- off is a common technique for reducing contention in shared resources, including access to shared networks and buses. This implementation still attempts to preserve low latency when contention is small by not delaying the initial spin loop. The result is that if many processes are waiting, the back-off does not affect the processes on their first attempt to acquire the lock. We could also delay that process, but the result would be poorer performance when the lock was in use by only two processes and the first one happened to find it locked.
ADDUI R3,R0,#1 ; R3 = initial delay
lockit: LL R2,0(R1) ; load linked
BNEZ R2,lockit ; not available-spin
DADDUI R2,R2,#1 ; get locked value
SC R2,0(R1) ; store conditional
BNEZ R2,gotit ; branch if store succeeds
DSLL R3,R3,#1 ;Increase delay by factor of 2
PAUSE R3 ; delays by value in R3
gotit: use data protected by lock
FIGURE 3.7 A spin lock with exponential back-off.
Another technique for implementing locks is to use queuing locks. Queuing locks work by constructing a queue of waiting processors; whenever a processor frees up the lock, it causes the next processor in the queue to attempt access. This eliminates contention for a lock when it is freed. We show how queuing locks operate in the next section using a hardware implementation, but software implementations using arrays can achieve most of the same benefits Before we look at hardware primitives,
5. Hardware Primitives
In this section we look at two hardware synchronization primitives. The first primitive deals with locks, while the second is useful for barriers and a number of other user-level operations that require counting or supplying distinct indices. In both cases we can create a hardware primitive where latency is essentially identical to our earlier version, but with much less serialization, leading to better scaling when there is contention.
The major problem with our original lock implementation is that it introduces a large amount of unneeded contention. For example, when the lock is released all processors generate both a read and a write miss, although at most one processor can successfull y get the lock in the unlocked state. This sequence happens on each of the 20 lock/unlock sequences.
We can improve this situation by explicitly handing the lock from one waiting processor to the next. Rather than simply allowing all processors to compet e every time the lock is released, we keep a list of the waiting processors and hand the lock to one explicitly, when its turn comes. This sort of mechanism has been called a queuing lock. Queuing locks can be implemented either in hardware, or in software using an array to keep track of the waiting processes.