Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors

Bandwidth Sharing Between Cores

Bandwidth Sharing Between Cores

Bandwidth is another resource shared between threads. The bandwidth capacity of a sys-tem depends on the design of the processor and the memory system as well as the memory chips and their location in the system. A consequence of this is that two systems can contain the same processor and same motherboard yet have two different measurements for bandwidth. Typically, a system configuring a system for best possible performance requires expensive memory chips.

The bandwidth a processor can consume is a function of the number of outstanding memory requests and the rate at which these can be returned. These memory requests can come from either hardware or software prefetches, as well as from load or store operations. Since each thread can issue memory requests, the more threads that a proces-sor can run, the more bandwidth the processor can consume.

Many of the string-handling library routines such as strlen() or memset() can be large consumers of memory bandwidth. Since these routines are provided as part of the operating system, they are often optimized to give the best performance for a given system. The code in Listing 9.17 uses multiple threads calling memset() on disjoint regions of memory in order to estimate the available memory bandwidth on a system. Bandwidth can be measured by dividing the amount of memory accessed by the time taken to complete the accesses.

Listing 9.17 Using memset to Measure Memory Bandwidth

#include <stdio.h> #include <stdlib.h> #include <strings.h> #include <pthread.h> #include <sys/time.h>

#define BLOCKSIZE 1024*1025 int nthreads = 8;

char * memory;

double now()

{

struct timeval time; gettimeofday( &time, 0 );

return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;

}

void *experiment( void *id )

{

unsigned int seed = 0; int count = 20000;

for( int i=0; i<count; i++ )

{

memset( &memory[BLOCKSIZE * (int)id], 0, BLOCKSIZE );

}

if ( seed == 1 ){ printf( "" ); }

}

int main( int argc, char* argv[] )

{

pthread_t threads[64];

memory = (char*)malloc( 64*BLOCKSIZE );

if ( argc > 1 ) { nthreads = atoi( argv[1] ); } double start = now();

for( int i=0; i<nthreads; i++ )

{

pthread_create( &threads[i], 0, experiment, (void*)i );

}

for ( int i=0; i<nthreads; i++ )

{

pthread_join( threads[i], 0 );

}

double end = now();

printf( "%i Threads Time %f s Bandwidth %f GB/s\n", nthreads, (end – start) ,

( (double)nthreads * BLOCKSIZE * 20000.0 ) / ( end – start) / 1000000000.0 );

return 0;

}

The results in Listing 9.18 show the bandwidth measured by the test code for one to eight virtual CPUs on a system with 64 virtual CPUs. For this particular system, the bandwidth scales nearly linearly with the number of threads until about six threads. After six threads, the bandwidth reduces. This might seem like a surprising result, but there are several effects that can cause this.

Listing 9.18 Memory Bandwidth Measured on a System with 64 Virtual CPUs

1 Threads Time 7.082376 s Bandwidth 2.76 GB/s

2 Threads Time 7.082576 s Bandwidth 5.52 GB/s

3 Threads Time 7.059594 s Bandwidth 8.31 GB/s

4 Threads Time 7.181156 s Bandwidth 10.89 GB/s

5 Threads Time 7.640440 s Bandwidth 12.79 GB/s

6 Threads Time 11.252412 s Bandwidth 10.42 GB/s

7 Threads Time 14.723671 s Bandwidth 9.29 GB/s

8 Threads Time 17.267288 s Bandwidth 9.06 GB/s

One possibility is that the threads are interfering on the processor. If multiple threads are sharing a core, the combined set of threads might be fully utilizing the instruction issue capacity of the core. We will discuss the sharing of cores between multiple threads in the section “Pipeline Resource Starvation.” A second interaction effect is if the threads start interfering in the caches, such as multiple threads attempting to load data to the same set of cache lines.

One other effect is the behavior of memory chips when they become saturated. At this point, the chips start experiencing queuing latencies where the response time for each request increases. Memory chips are arranged in banks. Accessing a particular address will lead to a request to a particular bank of memory. Each bank needs a gap between returning two responses. If multiple threads happen to hit the same bank, then the response time becomes governed by the rate at which the bank can return memory.

The consequence of all this interaction is that a saturated memory subsystem may end up returning data at less than the peak memory bandwidth. This is clearly seen in the example where the bandwidth peaks at five threads.

Listing 9.19 shows memory bandwidth measured on system with four virtual CPUs. This is a very different scenario. In this case, adding a second thread does not increase the memory bandwidth consumed. The system is already running at peak bandwidth consumption. Adding additional threads causes the system memory subsystem to show reduced bandwidth consumption for the reasons previously discussed.

Listing 9.19 Memory Bandwidth Measured on a System with Four Virtual CPUs

1 Threads Time 7.437563 s Bandwidth 2.63 GB/s

2 Threads Time 15.238317 s Bandwidth 2.57 GB/s

3 Threads Time 24.580981 s Bandwidth 2.39 GB/s

4 Threads Time 37.457352 s Bandwidth 2.09 GB/s

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors : Bandwidth Sharing Between Cores |