Bandwidth
Sharing Between Cores
Bandwidth is another resource shared between threads. The bandwidth
capacity of a sys-tem depends on the design of the processor and the memory
system as well as the memory chips and their location in the system. A
consequence of this is that two systems can contain the same processor and same
motherboard yet have two different measurements for bandwidth. Typically, a
system configuring a system for best possible performance requires expensive
memory chips.
The bandwidth a processor can consume is a function
of the number of outstanding memory requests and the rate at which these can be
returned. These memory requests can come from either hardware or software
prefetches, as well as from load or store operations. Since each thread can
issue memory requests, the more threads that a proces-sor can run, the more
bandwidth the processor can consume.
Many of the string-handling library routines such
as strlen() or memset() can be large consumers of memory bandwidth. Since these routines are
provided as part of the operating system, they are often optimized to give the
best performance for a given system. The code in Listing 9.17 uses multiple
threads calling memset() on
disjoint regions of memory in order to estimate the available memory bandwidth
on a system. Bandwidth can be measured by dividing the amount of memory
accessed by the time taken to complete the accesses.
Listing 9.17 Using
memset to Measure Memory Bandwidth
#include <stdio.h> #include <stdlib.h>
#include <strings.h> #include <pthread.h> #include
<sys/time.h>
#define BLOCKSIZE 1024*1025 int nthreads = 8;
char
* memory;
double
now()
{
struct timeval time; gettimeofday( &time, 0 );
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
}
void *experiment( void *id )
{
unsigned int seed = 0; int count = 20000;
for( int i=0; i<count; i++ )
{
memset(
&memory[BLOCKSIZE * (int)id], 0, BLOCKSIZE );
}
if ( seed == 1 ){ printf( "" ); }
}
int main( int argc, char* argv[] )
{
pthread_t threads[64];
memory = (char*)malloc( 64*BLOCKSIZE );
if ( argc > 1 ) { nthreads = atoi( argv[1] ); } double start = now();
for( int i=0; i<nthreads; i++ )
{
pthread_create( &threads[i], 0, experiment, (void*)i );
}
for ( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );
}
double end = now();
printf( "%i Threads Time %f s Bandwidth %f
GB/s\n", nthreads, (end – start) ,
( (double)nthreads * BLOCKSIZE * 20000.0 ) / ( end
– start) / 1000000000.0 );
return 0;
}
The results in Listing 9.18 show the bandwidth measured by the test code
for one to eight virtual CPUs on a system with 64 virtual CPUs. For this
particular system, the bandwidth scales nearly linearly with the number of
threads until about six threads. After six threads, the bandwidth reduces. This
might seem like a surprising result, but there are several effects that can
cause this.
Listing 9.18 Memory
Bandwidth Measured on a System with 64 Virtual CPUs
1 Threads Time 7.082376 s Bandwidth 2.76
GB/s
2 Threads Time 7.082576 s Bandwidth 5.52 GB/s
3 Threads Time 7.059594 s Bandwidth 8.31 GB/s
4 Threads Time 7.181156 s
Bandwidth 10.89 GB/s
5 Threads Time 7.640440 s
Bandwidth 12.79 GB/s
6
Threads Time 11.252412 s Bandwidth 10.42 GB/s
7
Threads Time 14.723671 s Bandwidth 9.29 GB/s
8
Threads Time 17.267288 s Bandwidth 9.06 GB/s
One possibility is that the threads are interfering
on the processor. If multiple threads are sharing a core, the combined set of
threads might be fully utilizing the instruction issue capacity of the core. We
will discuss the sharing of cores between multiple threads in the section “Pipeline
Resource Starvation.” A second interaction effect is if the threads start
interfering in the caches, such as multiple threads attempting to load data to
the same set of cache lines.
One other effect is the behavior of memory chips
when they become saturated. At this point, the chips start experiencing queuing
latencies where the response time for each request increases. Memory chips are
arranged in banks. Accessing a particular address will lead to a request to a
particular bank of memory. Each bank needs a gap between returning two
responses. If multiple threads happen to hit the same bank, then the response
time becomes governed by the rate at which the bank can return memory.
The consequence of all this interaction is that a
saturated memory subsystem may end up returning data at less than the peak
memory bandwidth. This is clearly seen in the example where the bandwidth peaks
at five threads.
Listing 9.19 shows memory bandwidth measured on
system with four virtual CPUs. This is a very different scenario. In this case,
adding a second thread does not increase the memory bandwidth consumed. The
system is already running at peak bandwidth consumption. Adding additional
threads causes the system memory subsystem to show reduced bandwidth consumption
for the reasons previously discussed.
Listing 9.19 Memory
Bandwidth Measured on a System with Four Virtual CPUs
1 Threads Time 7.437563 s Bandwidth 2.63 GB/s
2 Threads Time 15.238317 s Bandwidth 2.57 GB/s
3 Threads Time
24.580981 s Bandwidth 2.39 GB/s
4 Threads Time 37.457352 s Bandwidth 2.09 GB/s
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.