Synchronization
and various Hardware Primitives
1.
Synchronization
Synchronization mechanisms are typically built with
user-level software routines that rely on hardware-supplied synchronization
instructions. The efficient spin locks can be built using a simple hardware
synchronization instruction and the coherence mechanism.
2. Basic
Hardware Primitives
The key ability we require to implement
synchronization in a multiprocessor is a set of hardware primitives with the
ability to atomically read and modify a memory location. Without such a
capability, the cost of building basic synchronization primitives wil l be too
high and will increase as the processor count increases.
There are a number of alternative formulations of
the basic hardware primitives, all of which provide the ability to atomically
read and modify a location, together with some way to tell if the read and
write were performed atomically.
These hardware primitives are the basic building
blocks that are used to build a wide variety of user-level synchronization
operations, including things such as locks and barriers.
One typical operation for building synchronization
operations is the atomic exchange, which interchanges a value in a register for
a value in memory.
Use this
to build a basic synchronization operation, assume that we want to build a
Simple
lock where the value 0 is used to indicate that the lock is free and a 1 is
used to indicate that the lock is unavailable.
A
processor tries to set the lock by doing an exchange of 1, which is in a
register, with the memory address corresponding to the lock. The value returned
from the exchange instruction is 1 if some other processor had already claimed
access and 0 otherwise. In the latter case, the value is also changed to be 1,
preventing any competing exchange from also retrieving a 0.
There are a number of other atomic primitives that
can be used to implement synchronization. They all have the key property that
they read and update a memory value in such a manner that we can tell whether
or not the two operations executed atomically. One operation, present in many
older multiprocessors, is test-and-set, which tests a value and sets it if the
value passes the test. For example, we could define an operation that tested
for 0 and set the value to 1, which can be used in a fashion similar to how we
used atomic exchange.
Another atomic synchronization primitive is
fetch-and-increment: it returns the value of a memory location and atomically
increments it. By using the value 0 to indicate that the synchronization
variable is unclaimed, we can use fetch-and-increment, just as we used
exchange. There are other uses of operations like fetch-and-increment.
3. Implementing Locks Using
Coherence
We can use the coherence mechanisms of a multiprocessor
to implement spin locks: locks that a processor continuously tries to acquire,
spinning around a loop until it succeeds. Spin locks are used when a programmer
expects the lock to be held for a very short amount of time and when she wants
the process of locking to be low latency when the lock is available. Because
spin locks tie up the processor, waiting in a loop for the lock to become free,
they are inappropriate in some circumstances.
The
simplest implementation, which we would use if there were no cache coherence,
supports cache coherence, we can maintain the lock value coherent an
implementation where the proca tight loop) could be done on a lo would keep the
lock variables in memory. A processor could continually try to acquire the lock
using an atomic operation, say exchange, and test whether the exchange returned
the lock as free. To release the lock, the processor simply stores the value 0
to the lock. Here is the code sequence to lock a spin lock whose address is in
R1 using an atomic exchange:
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1)
; atomic exchange
BNEZ R2,lockit
; already locked?
If our
multiprocessor supports cache
coherence, we can
cache the locks
using the
coherence
mechanism to maintain the lock value coherently. Caching locks has two
advantages.
First, it
allows an implementation where the process of “spinning” (trying to test and
acquire the lock in a tight loop) could be done on a local cached copy rather
than requiring a global memory access on each attempt to acquire the lock.
The
second advantage comes from the observation that there is often locality in
lock accesses: that is, the processor that used the lock last will use it again
in the near future. In such cases, the lock value may reside in the cache of that
processor, greatly reducing the time to acquire the lock..
4.
Synchronization Performance Challenges
Barrier Synchronization
One additional common synchronization operation in
programs with parallel loops is a barrier. A barrier forces all processes to
wait until all the processes reach the barrier and then releases all of the
processes. A typical implementation of a barrier can be done with two spin
locks: one used to protect a counter that tallies the proces ses arriving at
the barrier and one used to hold the processes until the last process arrives
at the barrier.
Synchronization Mechanisms for
Larger-Scale Multiprocessors
Software Implementations
The major difficulty with our spin-lock
implementation is the delay due to contention when many processes are spinning
on the lock. One solution is to artificially delay processes when they fail to
acquire the lock. The best performance is obtained by increasing the delay
exponentially whenever the attempt to acquire the lock fails.
Figure
3.7 shows how a spin lock with exponential back-off is implemented. Exponential
back- off is a common technique for reducing contention in shared resources,
including access to shared networks and buses. This implementation still
attempts to preserve low latency when contention is small by not delaying the
initial spin loop. The result is that if many processes are waiting, the
back-off does not affect the processes on their first attempt to acquire the
lock. We could also delay that process, but the result would be poorer
performance when the lock was in use by only two processes and the first one
happened to find it locked.
ADDUI R3,R0,#1 ; R3 = initial delay
lockit: LL R2,0(R1)
; load linked
BNEZ R2,lockit ; not available-spin
DADDUI R2,R2,#1 ; get
locked value
SC R2,0(R1) ; store
conditional
BNEZ R2,gotit ; branch if store succeeds
DSLL R3,R3,#1 ;Increase delay by factor of
2
PAUSE R3 ; delays
by value in R3
J lockit
gotit:
use data protected by lock
FIGURE
3.7 A spin lock with exponential back-off.
Another
technique for implementing locks is to use queuing locks. Queuing locks work by
constructing a queue of waiting processors; whenever a processor frees up the
lock, it causes the next processor in the queue to attempt access. This
eliminates contention for a lock when it is freed. We show how queuing locks
operate in the next section using a hardware implementation, but software
implementations using arrays can achieve most of the same benefits Before we
look at hardware primitives,
5.
Hardware Primitives
In this section we look at two hardware
synchronization primitives. The first primitive deals with locks, while the
second is useful for barriers and a number of other user-level operations that
require counting or supplying distinct indices. In both cases we can create a
hardware primitive where latency is essentially identical to our earlier
version, but with much less serialization, leading to better scaling when there
is contention.
The major problem with our original lock
implementation is that it introduces a large amount of unneeded contention. For
example, when the lock is released all processors generate both a read and a
write miss, although at most one processor can successfull y get the lock in
the unlocked state. This sequence happens on each of the 20 lock/unlock
sequences.
We can improve this situation by explicitly handing
the lock from one waiting processor to the next. Rather than simply allowing
all processors to compet e every time the lock is released, we keep a list of
the waiting processors and hand the lock to one explicitly, when its turn
comes. This sort of mechanism has been called a queuing lock. Queuing locks can
be implemented either in hardware, or in software using an array to keep track
of the waiting processes.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.