Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors

False Sharing

False sharing is the situation where multiple threads are accessing items of data held on a single cache line.

False Sharing

False sharing is the situation where multiple threads are accessing items of data held on a single cache line. Although the threads are all using separate items of data, the cache line itself is shared between them so only a single thread can write to it at any one time. This is purely a performance issue because there is no correctness issue. It would be a correct-ness issue if it were a single variable on the cache line being shared between the threads.

The performance impact comes from the fact that each thread requires the cache line to be present and writable in the cache of the processor where the thread is executing. If another thread recently wrote to the cache line, then the modified data needs to be written back to memory and then sent to the next processor that wants to write to it.

This can cause accesses to the cache line to take a similar length of time as a miss to memory. In the case of false sharing, the line is constantly being bounced between processors, so most accesses to it end up requiring another processor to write the line back to memory—so the line does not ever get the benefit of being cache resident.

It is easy to demonstrate the cost of false sharing using the code in Listing 9.20. The code assigns each thread a volatile variable to use as a counter. The fact that the variable is volatile ensures that the code must store and reload it with every iteration. It also ensures that the compiler cannot eliminate the loop, even though the work it performs is redundant. The code creates multiple threads and then times how long it takes the first thread to complete the same amount of work as the other threads.

Listing 9.20 Example of False Sharing

#include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <sys/time.h>

double now()

{

struct timeval time; gettimeofday( &time, 0 );

return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;

}

#define COUNT 100000000 volatile int go = 0; volatile int counters[20];

void *spin( void *id )

{

int myid = (int)id + 1; while( !go ) {} counters[myid] = 0;

while ( counters[myid]++ < COUNT ) {}

}

int main( int argc, char* argv[] )

{

pthread_t threads[256]; int nthreads = 1;

if ( argc > 1 ) { nthreads = atoi( argv[1] ); } for( int i=1; i<nthreads; i++ )

{

pthread_create( &threads[i], 0, spin, (void*)i );

}

double start = now();

go = 1;

spin( 0 );

double end = now();

printf("Time %f ns\n", ( end – start ) );

for( int i=0; i<nthreads; i++ )

{

pthread_join( threads[i], 0 );

}

return 0;

}

If we take the code from Listing 9.20 and run a single thread, the thread completes its work in about nine seconds on a system with two dual-core processors. Using four threads on the same system results in a runtime for the code of about 100 seconds—a slowdown of about 10 times.

It is very easy to solve false sharing by padding the accessed structures so that the variable used by each thread resides on a separate cache line. The cache line can then reside in the cache of the processor where the thread is running, and consequently, all accesses to that cache line are low cost, and the code runs much faster.

Listing 9.21 shows a modified version of the code where accesses to the counter structure have been padded so that each counter is located at 64-byte intervals. This will ensure that the variables are located on separate cache lines on machines with cache line sizes of 64 bytes or less.

Listing 9.21 Data Padded to Avoid False Sharing

#include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <sys/time.h>

double now()

{

struct timeval time; gettimeofday( &time, 0 );

return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;

}

#define COUNT 100000000 volatile int go = 0;

volatile int counters[320];

void *spin( void *id )

{

int myid = ( (int)id + 1) * 16;

while( !go ) {} counters[myid] = 0;

while ( counters[myid]++ < COUNT ) {}

}

int main( int argc, char* argv[] )

{

pthread_t threads[256]; int nthreads = 1;

if ( argc > 1 ) { nthreads = atoi( argv[1] ); } nthreads--;

for( int i=1; i<nthreads+1; i++ )

{

pthread_create( &threads[i], 0, spin, (void*)i );

}

double start = now(); go=1;

spin( 0 );

double end = now();

printf( "Time %f s\n", ( end – start ) );

for( int i=0; i<nthreads; i++ )

{

pthread_join( threads[i], 0 );

}

return 0;

}

The modified code takes about nine seconds to run with four threads on the same machine.

Although fixing false sharing is easy to do in most cases, detecting performance loss from it is much harder. In general, false sharing will turn up as an elevated number of cache misses on a particular memory operation, and it is hard to distinguish this from the normal cache misses that occur in all applications. However, the important thing to realize from the previous description is that when significant time is lost to false sharing, there will be significant numbers of cache misses, indicating the points at which the false sharing occurs. Hence, tracking down false sharing can be as simple as locating the places in the code where there are unexpectedly high numbers of cache misses on variables that should be local to one thread.

One important thing to note is that false sharing is a significant issue for multiprocessor systems but is not nearly as critical for multicore systems. For example, if we take the code from Listing 9.20 that has the false sharing issue and run it on a CMT system, it takes 24 seconds with one thread and 26 seconds with four; the runtime is not significantly changed by the presence of false sharing. The data is shared through the on-chip caches, and this sharing has little impact on performance. This is a useful feature of multicore processors.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Scaling with Multicore Processors : False Sharing |