False
Sharing
False sharing is the situation where multiple threads are accessing items of data held
on a single cache line. Although the
threads are all using separate items of data, the cache line itself is shared
between them so only a single thread can write to it at any one time. This is
purely a performance issue because there is no correctness issue. It would be a
correct-ness issue if it were a single variable on the cache line being shared
between the threads.
The performance impact comes from the fact that each thread requires the
cache line to be present and writable in the cache of the processor where the
thread is executing. If another thread recently wrote to the cache line, then
the modified data needs to be written back to memory and then sent to the next
processor that wants to write to it.
This can cause accesses to the cache line to take a similar length of
time as a miss to memory. In the case of false sharing, the line is constantly
being bounced between processors, so most accesses to it end up requiring
another processor to write the line back to memory—so the line does not ever
get the benefit of being cache resident.
It is easy to demonstrate the cost of false sharing using the code in Listing
9.20. The code assigns each thread a volatile variable to use as a counter. The
fact that the variable is volatile ensures that the code must store and reload
it with every iteration. It also ensures that the compiler cannot eliminate the
loop, even though the work it performs is redundant. The code creates multiple
threads and then times how long it takes the first thread to complete the same
amount of work as the other threads.
Listing 9.20 Example of False Sharing
#include <stdio.h> #include <stdlib.h>
#include <pthread.h> #include <sys/time.h>
double now()
{
struct timeval time; gettimeofday( &time, 0 );
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
}
#define COUNT 100000000 volatile int go = 0;
volatile int counters[20];
void *spin( void *id )
{
int myid = (int)id + 1; while( !go ) {} counters[myid] = 0;
while (
counters[myid]++ < COUNT ) {}
}
int main( int argc, char* argv[] )
{
pthread_t threads[256]; int nthreads = 1;
if ( argc > 1 ) { nthreads = atoi( argv[1] ); }
for( int i=1; i<nthreads; i++ )
{
pthread_create( &threads[i], 0, spin, (void*)i );
}
double start = now();
go = 1;
spin( 0 );
double end = now();
printf("Time %f ns\n", ( end – start ) );
for( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );
}
return 0;
}
If we take the code from Listing 9.20 and run a
single thread, the thread completes its work in about nine seconds on a system
with two dual-core processors. Using four threads on the same system results in
a runtime for the code of about 100 seconds—a slowdown of about 10 times.
It is very easy to solve false sharing by padding
the accessed structures so that the variable used by each thread resides on a
separate cache line. The cache line can then reside in the cache of the processor
where the thread is running, and consequently, all accesses to that cache line
are low cost, and the code runs much faster.
Listing 9.21 shows a modified version of the code
where accesses to the counter structure have been padded so that each counter
is located at 64-byte intervals. This will ensure that the variables are
located on separate cache lines on machines with cache line sizes of 64 bytes
or less.
Listing 9.21 Data
Padded to Avoid False Sharing
#include <stdio.h> #include <stdlib.h>
#include <pthread.h> #include <sys/time.h>
double
now()
{
struct timeval time; gettimeofday( &time, 0 );
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
}
#define COUNT 100000000 volatile int go = 0;
volatile int counters[320];
void
*spin( void *id )
{
int myid = (
(int)id + 1) * 16;
while( !go ) {} counters[myid] = 0;
while ( counters[myid]++ < COUNT ) {}
}
int main( int argc, char* argv[] )
{
pthread_t threads[256]; int nthreads = 1;
if ( argc > 1 ) { nthreads = atoi( argv[1] ); } nthreads--;
for( int i=1; i<nthreads+1; i++ )
{
pthread_create( &threads[i], 0, spin, (void*)i );
}
double start = now(); go=1;
spin( 0 );
double end = now();
printf( "Time %f s\n", ( end – start ) );
for( int i=0; i<nthreads; i++ )
{
pthread_join( threads[i], 0 );
}
return 0;
}
The modified code takes about nine seconds to run
with four threads on the same machine.
Although fixing false sharing is easy to do in most
cases, detecting performance loss from it is much harder. In general, false
sharing will turn up as an elevated number of cache misses on a particular
memory operation, and it is hard to distinguish this from the normal cache
misses that occur in all applications. However, the important thing to realize
from the previous description is that when significant time is lost to false
sharing, there will be significant numbers of cache misses, indicating the
points at which the false sharing occurs. Hence, tracking down false sharing
can be as simple as locating the places in the code where there are
unexpectedly high numbers of cache misses on variables that should be local to
one thread.
One important thing to note is that false sharing is a significant issue
for multiprocessor systems but is not nearly as critical for multicore
systems. For example, if we take the code from Listing 9.20 that has the false
sharing issue and run it on a CMT system, it takes 24 seconds with one thread
and 26 seconds with four; the runtime is not significantly changed by the
presence of false sharing. The data is shared through the on-chip caches, and
this sharing has little impact on performance. This is a useful feature of
multicore processors.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.