Pipeline
Resource Starvation
If a single thread is running on a core, its performance is limited by
the rate at which the core can issue instructions and the number of stalls that
the thread encounters. For example, a core might be able to sustain only a
single load every cycle, so a code that contained only load instructions could
at most issue a single instruction per cycle. Similarly, a code that consisted
of a stream of dependent load instructions would only be able to issue the next
instruction once the previous load had completed, and the time the load would
take would represent the latency of the memory system.
When multiple threads share a core, the rate at
which the threads can issue instruc-tions is limited by two constraints. The
first constraint is the maximum rate at which the hardware can issue
instructions. For example, if there is only one load pipeline, then only one
load instruction can be issued per cycle. The second constraint is that all the
threads have to share the available hardware resources. If there is one load
pipeline and two threads, then only one thread can issue a load instruction at
a time.
It is instructive to think of
this in terms of cache misses. When one thread is stalled waiting for data to
be returned from memory, then all the other threads on that core can take
instruction issue slots from the stalled thread. If all threads are stalled
waiting for data from memory, then the issue width of the processor is not the
dominating con-straint on performance. This is the ideal situation for a CMT
processor. If the threads are stalled waiting for memory, then it is more
effective to have a large number of these threads all making forward process.
Although individual threads might run slowly, the aggregate throughput of the
system is impressive because of the number of threads.
When threads are not stalled on
memory, such as when the data is cache resident or the code is compute
intensive, the threads can become limited on instruction issue width. A single
core supporting multiple threads relies on gaps where each thread is stalled to
find cycles where an instruction from a different thread can be executed. If
there are too many compute-intensive threads running on the core, then
instruction issue width becomes a limiting factor in performance.
Consider the code shown in
Listing 9.26, which runs a number of threads, and each thread is performing a
simple set of integer operations on an item of data. Integer opera-tions are
typically single-cycle instructions, so one thread will desire to execute a
single instruction every cycle. This is achievable when there is a single
thread running on the core, but it can be limited on some processors when
multiple threads are assigned to the same core.
Listing 9.26 Worker Threads That Perform Sequences of Single-Cycle Integer
Operations
#include <stdio.h> #include <stdlib.h>
#include <strings.h> #include <pthread.h>
#include <sys/types.h> #include
<sys/processor.h> #include <sys/procset.h>
int nthreads = 8;
void *experiment( void *id )
{
unsigned
int seed = 0;
processor_bind( P_LWPID, P_MYID, ((int)id*2), 0 ); int count =
100000000;
for( int i=0; i<count; i++ )
{
seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556);
}
if ( seed == 1 ) { printf( "" ); }
}
int
main( int argc, char* argv[] )
{
pthread_t threads[64];
if ( argc > 1 ) { nthreads = atoi( argv[1] ); }
for( int i=0; i<nthreads; i++ )
{
pthread_create( &threads[i], 0, experiment, (void*)i );
}
for( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );
}
return 0;
}
The code shown in Listing 9.26 uses the Solaris processor_bind call to bind threads to particular virtual CPUs. This is necessary
because the operating system will typically try to locate threads so that each
thread gets the maximal resources. In this instance, we do not want that to
happen because we want to investigate what happens when threads are scheduled
on the same core. Doing this correctly requires knowledge of how the vir-tual
CPU number maps to cores. This mapping often falls into one of two types.
The one common mapping is that virtual CPU numbers
are assigned to cores in groups. If a core supports four threads, then virtual
CPUs 0 to 3 would correspond to the four threads on the the first core.
The other common mapping scheme is to interleave cores. In this case, if
there were two cores, then all the odd-numbered virtual CPUs would map to one
core and all the even-numbered virtual CPUs would map to the other core.
The code as shown is for a machine where the
virtual CPUs are interleaved. The par-ticular Intel machine this was run on had
two hyperthreading-enabled cores, making a total of four virtual CPUs. Virtual
CPUs 0 and 2 are located on core 0, and virtual CPUs 1 and 3 are located on
core 1.
When the bit manipulation code is run with one
thread, it takes about two seconds. When run with two threads on the same core,
it runs in two and a half seconds. If the runtime had remained the same, then
twice as much work would have been achieved in the same time. However, this was
not the case—twice as much work was achieved in 25% more time. To put it
another way, this represents a 60% gain in throughput.
The same code was also run on an UltraSPARC T2 machine, with the binding
suit-ably modified. The topology of this machine is interesting because each core
can support eight threads, and these threads are arranged into two groups of
four. Each group of four can issue one integer operation per cycle, but the two
groups share a single load/store pipeline and a single floating-point pipeline.
When run on this machine, using a single thread, the code took four seconds.
When two threads were assigned to the same group on the same core, the threads
completed in seven seconds, achieving only slightly more work per unit time
than a single thread. When the two threads are assigned to different groups on
the same core, the threads complete their work in four seconds. With this
dis-tribution of threads, the amount of work achieved per unit time is doubled.
As more threads become active, the scaling of a
processor depends on the architecture of the core. The more available capacity,
the more the throughput of the core will increase as the number of threads
increases. An alternative view is that cores that work very hard to extract
every bit of performance from a single-threaded code will end up with fewer
spare cycles for a second thread to use. Once a core becomes fully utilized,
additional threads will steal cycles from those that are already running and
will cause all the active threads to run slightly slower.
If all the threads are busy
performing useful work, then to a large extent it does not matter how well the
processor scales—as long as it is completing more work than it was when running
with a single thread.
Where performance is potentially
lost is when one of the threads is spinning, waiting for some signal. The
spinning thread will be taking instruction issue opportunities from the other
threads running on the core. The code in Listing 9.27 demonstrates how a
spinning thread can detract from the performance of the worker thread. This
code creates multiple spinning threads and then attempts to complete a
predefined amount of work.
Listing 9.27 Spinning Threads
That Take Instruction Issue Width from the Thread Performing the Work
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include
<pthread.h>
#include <sys/types.h> #include
<sys/processor.h> #include <sys/procset.h>
int nthreads = 8; volatile int busy = 1;
void
* spin( void *id )
{
processor_bind( P_LWPID, P_MYID, (int)id, 0 ); while ( busy ) {}
}
void
experiment( int id )
{
unsigned int seed = 0;
processor_bind( P_LWPID, P_MYID, id, 0 ); int count = 100000000;
for( int i=0; i<count; i++ )
{
seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556); seed += (seed<<2) ^ (seed|33556); seed +=
(seed<<2) ^ (seed|33556);
}
if ( seed == 1 ) { printf( "" ); }
}
int
main( int argc, char* argv[] )
{
pthread_t threads[64];
if ( argc > 1 ) { nthreads = atoi( argv[1] ); }
for( int i=0; i<nthreads; i++ )
{
pthread_create( &threads[i], 0, spin, (void*)(i+1) );
}
experiment( 0 ); busy = 0;
for( int i=0; i<nthreads; i++ )
{
pthread_join(
threads[i], 0 );
}
return 0;
}
On a test machine, the code takes four seconds to complete when there
are no other threads scheduled to the same core. With one spinning thread
scheduled to the same core, the code takes five and a half seconds to complete.
With two threads it takes 8 sec-onds, and with three spinning threads it takes
12 seconds.
The progression of runtimes is interesting. The
first spinning thread slows the original thread by only about 25% because the
spinning thread mainly performs load and branch instructions, both of which
have a latency of a few cycles before the thread is again ready to execute a
new instruction. The net result is that the first spinning thread wants to issue
an instruction every four cycles, which leaves three cycles for the original
thread to execute instructions. The second spinning thread follows the same
pattern, so now there are two threads that want to issue an instruction every
four cycles. This leaves the origi-nal thread able to issue an instruction
every other cycle, so it runs twice as slowly. When the third spinning thread
is added, the core is already fully occupied by running the original thread and
the two spinning threads, so adding an additional thread takes the sit-uation
from one where the three threads are getting to issue about one instruction
every three cycles to a situation where they get to issue an instruction every
four cycles. This is a 33% slowdown, which takes the runtime from 8 to 12
seconds.
One way of reducing the impact of
spinning threads is to cause the thread to pause and consume no instruction
issue resources. Most processors have a long latency instruc-tion that can be
used for this purpose even if they do not have an explicit instruction for
delaying.
For example, on SPARC processors the operation to
read the tick register takes a few cycles. The modified spin function shown in
Listing 9.28 uses this to insert pauses into the spin cycle.
Listing 9.28 Reading
the Tick Register to Insert Delays into the Spin Code
void * spin( void *id )
{
processor_bind( P_LWPID, P_MYID, (int)id, 0 ); while( busy )
{
asm( "rd %tick,%o0": : :"" );
}
}
Inserting this delay takes the runtime of the code from 12 seconds with
three spin-ning threads down to 7 seconds. Repeating the same instruction a
couple of times or identifying a longer latency operation could be used to
further reduce the impact of spinning threads.
Recent x86 processors have implemented a pause instruction, which causes the thread to pause for a number of cycles
before issuing the next instruction. On older processors, the pause instruction maps onto a no-op instruction; therefore, the use of the instruction is fine on older
hardware, but the old hardware will not actually pause. Windows implements a yieldprocessor() macro for pause/delay to provide an easy way of accessing this
operation.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.