Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors

Pipeline Resource Starvation

If a single thread is running on a core, its performance is limited by the rate at which the core can issue instructions and the number of stalls that the thread encounters.

Pipeline Resource Starvation

If a single thread is running on a core, its performance is limited by the rate at which the core can issue instructions and the number of stalls that the thread encounters. For example, a core might be able to sustain only a single load every cycle, so a code that contained only load instructions could at most issue a single instruction per cycle. Similarly, a code that consisted of a stream of dependent load instructions would only be able to issue the next instruction once the previous load had completed, and the time the load would take would represent the latency of the memory system.

When multiple threads share a core, the rate at which the threads can issue instruc-tions is limited by two constraints. The first constraint is the maximum rate at which the hardware can issue instructions. For example, if there is only one load pipeline, then only one load instruction can be issued per cycle. The second constraint is that all the threads have to share the available hardware resources. If there is one load pipeline and two threads, then only one thread can issue a load instruction at a time.

It is instructive to think of this in terms of cache misses. When one thread is stalled waiting for data to be returned from memory, then all the other threads on that core can take instruction issue slots from the stalled thread. If all threads are stalled waiting for data from memory, then the issue width of the processor is not the dominating con-straint on performance. This is the ideal situation for a CMT processor. If the threads are stalled waiting for memory, then it is more effective to have a large number of these threads all making forward process. Although individual threads might run slowly, the aggregate throughput of the system is impressive because of the number of threads.

When threads are not stalled on memory, such as when the data is cache resident or the code is compute intensive, the threads can become limited on instruction issue width. A single core supporting multiple threads relies on gaps where each thread is stalled to find cycles where an instruction from a different thread can be executed. If there are too many compute-intensive threads running on the core, then instruction issue width becomes a limiting factor in performance.

Consider the code shown in Listing 9.26, which runs a number of threads, and each thread is performing a simple set of integer operations on an item of data. Integer opera-tions are typically single-cycle instructions, so one thread will desire to execute a single instruction every cycle. This is achievable when there is a single thread running on the core, but it can be limited on some processors when multiple threads are assigned to the same core.

Listing 9.26 Worker Threads That Perform Sequences of Single-Cycle Integer Operations

#include <stdio.h> #include <stdlib.h> #include <strings.h> #include <pthread.h>

#include <sys/types.h> #include <sys/processor.h> #include <sys/procset.h>

int nthreads = 8;

void *experiment( void *id )

{

unsigned int seed = 0;

processor_bind( P_LWPID, P_MYID, ((int)id*2), 0 ); int count = 100000000;

for( int i=0; i<count; i++ )

{

}

if ( seed == 1 ) { printf( "" ); }

}

int main( int argc, char* argv[] )

{

pthread_t threads[64];

if ( argc > 1 ) { nthreads = atoi( argv[1] ); } for( int i=0; i<nthreads; i++ )

{

pthread_create( &threads[i], 0, experiment, (void*)i );

}

for( int i=0; i<nthreads; i++ )

{

pthread_join( threads[i], 0 );

}

return 0;

}

The code shown in Listing 9.26 uses the Solaris processor_bind call to bind threads to particular virtual CPUs. This is necessary because the operating system will typically try to locate threads so that each thread gets the maximal resources. In this instance, we do not want that to happen because we want to investigate what happens when threads are scheduled on the same core. Doing this correctly requires knowledge of how the vir-tual CPU number maps to cores. This mapping often falls into one of two types.

The one common mapping is that virtual CPU numbers are assigned to cores in groups. If a core supports four threads, then virtual CPUs 0 to 3 would correspond to the four threads on the the first core.

The other common mapping scheme is to interleave cores. In this case, if there were two cores, then all the odd-numbered virtual CPUs would map to one core and all the even-numbered virtual CPUs would map to the other core.

The code as shown is for a machine where the virtual CPUs are interleaved. The par-ticular Intel machine this was run on had two hyperthreading-enabled cores, making a total of four virtual CPUs. Virtual CPUs 0 and 2 are located on core 0, and virtual CPUs 1 and 3 are located on core 1.

When the bit manipulation code is run with one thread, it takes about two seconds. When run with two threads on the same core, it runs in two and a half seconds. If the runtime had remained the same, then twice as much work would have been achieved in the same time. However, this was not the case—twice as much work was achieved in 25% more time. To put it another way, this represents a 60% gain in throughput.

The same code was also run on an UltraSPARC T2 machine, with the binding suit-ably modified. The topology of this machine is interesting because each core can support eight threads, and these threads are arranged into two groups of four. Each group of four can issue one integer operation per cycle, but the two groups share a single load/store pipeline and a single floating-point pipeline. When run on this machine, using a single thread, the code took four seconds. When two threads were assigned to the same group on the same core, the threads completed in seven seconds, achieving only slightly more work per unit time than a single thread. When the two threads are assigned to different groups on the same core, the threads complete their work in four seconds. With this dis-tribution of threads, the amount of work achieved per unit time is doubled.

As more threads become active, the scaling of a processor depends on the architecture of the core. The more available capacity, the more the throughput of the core will increase as the number of threads increases. An alternative view is that cores that work very hard to extract every bit of performance from a single-threaded code will end up with fewer spare cycles for a second thread to use. Once a core becomes fully utilized, additional threads will steal cycles from those that are already running and will cause all the active threads to run slightly slower.

If all the threads are busy performing useful work, then to a large extent it does not matter how well the processor scales—as long as it is completing more work than it was when running with a single thread.

Where performance is potentially lost is when one of the threads is spinning, waiting for some signal. The spinning thread will be taking instruction issue opportunities from the other threads running on the core. The code in Listing 9.27 demonstrates how a spinning thread can detract from the performance of the worker thread. This code creates multiple spinning threads and then attempts to complete a predefined amount of work.

Listing 9.27 Spinning Threads That Take Instruction Issue Width from the Thread Performing the Work

#include <stdio.h>

#include <stdlib.h>

#include <strings.h>

#include <pthread.h>

#include <sys/types.h> #include <sys/processor.h> #include <sys/procset.h>

int nthreads = 8; volatile int busy = 1;

void * spin( void *id )

{

processor_bind( P_LWPID, P_MYID, (int)id, 0 ); while ( busy ) {}

}

void experiment( int id )

{

unsigned int seed = 0;

processor_bind( P_LWPID, P_MYID, id, 0 ); int count = 100000000;

for( int i=0; i<count; i++ )

{

}

if ( seed == 1 ) { printf( "" ); }

}

int main( int argc, char* argv[] )

{

pthread_t threads[64];

if ( argc > 1 ) { nthreads = atoi( argv[1] ); } for( int i=0; i<nthreads; i++ )

{

pthread_create( &threads[i], 0, spin, (void*)(i+1) );

}

experiment( 0 ); busy = 0;

for( int i=0; i<nthreads; i++ )

{

pthread_join( threads[i], 0 );

}

return 0;

}

On a test machine, the code takes four seconds to complete when there are no other threads scheduled to the same core. With one spinning thread scheduled to the same core, the code takes five and a half seconds to complete. With two threads it takes 8 sec-onds, and with three spinning threads it takes 12 seconds.

The progression of runtimes is interesting. The first spinning thread slows the original thread by only about 25% because the spinning thread mainly performs load and branch instructions, both of which have a latency of a few cycles before the thread is again ready to execute a new instruction. The net result is that the first spinning thread wants to issue an instruction every four cycles, which leaves three cycles for the original thread to execute instructions. The second spinning thread follows the same pattern, so now there are two threads that want to issue an instruction every four cycles. This leaves the origi-nal thread able to issue an instruction every other cycle, so it runs twice as slowly. When the third spinning thread is added, the core is already fully occupied by running the original thread and the two spinning threads, so adding an additional thread takes the sit-uation from one where the three threads are getting to issue about one instruction every three cycles to a situation where they get to issue an instruction every four cycles. This is a 33% slowdown, which takes the runtime from 8 to 12 seconds.

One way of reducing the impact of spinning threads is to cause the thread to pause and consume no instruction issue resources. Most processors have a long latency instruc-tion that can be used for this purpose even if they do not have an explicit instruction for delaying.

For example, on SPARC processors the operation to read the tick register takes a few cycles. The modified spin function shown in Listing 9.28 uses this to insert pauses into the spin cycle.

Listing 9.28 Reading the Tick Register to Insert Delays into the Spin Code

void * spin( void *id )

{

processor_bind( P_LWPID, P_MYID, (int)id, 0 ); while( busy )

{

asm( "rd %tick,%o0": : :"" );

}

Inserting this delay takes the runtime of the code from 12 seconds with three spin-ning threads down to 7 seconds. Repeating the same instruction a couple of times or identifying a longer latency operation could be used to further reduce the impact of spinning threads.

Recent x86 processors have implemented a pause instruction, which causes the thread to pause for a number of cycles before issuing the next instruction. On older processors, the pause instruction maps onto a no-op instruction; therefore, the use of the instruction is fine on older hardware, but the old hardware will not actually pause. Windows implements a yieldprocessor() macro for pause/delay to provide an easy way of accessing this operation.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors : Pipeline Resource Starvation |