Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Hand-Coded Synchronization and Sharing

Reordering of Operations by the Compiler

Although the hardware can reorder operations, it is also possible for the compiler to do the same. Although the compiler will try to do the “right thing,” there will be situations where it needs hints in order to produce code that runs correctly.

Reordering of Operations by the Compiler

It is often thought that the volatile keyword provides a safety net that stops the compiler from optimizing code. Unfortunately, this is not true. The volatile keyword determines that the compiler needs to reload data from memory when it needs to use it and should store it back to memory as soon as it is modified. It does not stop the com-piler from performing optimizations around the data nor does it form any kind of pro-tection against data races.

However, the volatile keyword is necessary to avoid the compiler optimizing away accesses to a variable. Consider the code in Listing 8.8. The variable start is declared as volatile. If this were not the case, then the compiler would determine that the loop was either not entered or entered and infinite.

Listing 8.8 Code Where volatile Keyword Is Necessary to Ensure Reloading of Variable start

volatile int start;

void waitforstart()

{

while( start == 0 ) {}

}

A very similar situation exists with the code in Listing 8.9. In this case, the code con-tains a function call. Function calls, generally, cause the compiler to reload global data. However, if the variable count is not declared to be volatile, it is still possible for the compiler to generate an infinite loop. The reason for this is that if count does equal zero and it is not a volatile variable, then the call to getElementFromList() will not be made, and the variable will remain zero. An optimizing compiler may identify this situa-tion and replace the else branch of the if statement with an infinite loop.

Listing 8.9 Function Calls Do Not Always Enforce Reloading of All Variables

volatile int count;

int* getnextelement()

{

int element = 0; while( element == 0 )

{

if (count>0)

{

element = getElementFromList();

}

return element;

}

The code shown in Listing 8.10 demonstrates a situation where the compiler can merge two loops, resulting in a change of behavior. The code creates four threads. Each thread runs the same routine that prints out a statement that the thread is ready and then waits on the volatile variable start. When the thread is started, it prints out a mes-sage before completing a short amount of work, printing a message indicating that it has completed its work and exiting.

The main code creates the threads and, once all the threads are created, signals to the threads that they can start working. Once this has been done, it waits for all the threads to complete their work, before calling pthread_join() on each of the threads.

Listing 8.10 Code Where the Compiler May Reorder Operations

#include <stdio.h> #include <pthread.h>

volatile int start[4]; volatile int done[4];

void * work( void* param )

{

int id = (int)param;

while ( start[id] == 0 ) {}

printf( "Thread %i started\n", id ); double total=0;

for ( int i=0; i<100000000; i++ ) { total += i; } printf( "Thread %i done\n", id );

done[id] = 1;

}

int main()

{

pthread_t thread[4];

for ( int i=0; i<4; i++ )

{

pthread_create( &thread[i], 0, work, (void*)i ); done[i]=0;

}

for ( int i=0; i<4; i++ )

{

start[i] = 1;

}

for ( int i=0; i<4; i++ )

{

while( done[i] == 0 ){}

}

for ( int i=0; i<4; i++ )

{

pthread_join( thread[i], 0 );

}

Listing 8.11 shows the output from the code with and without optimization. There is a pronounced difference between the codes. When compiled without optimization, all the threads proceed at the same time, so the output from the application shows all the threads starting and then all the threads completing. When compiled with optimization, the threads are serialized, so each thread prints the message that it has started followed by the message that it has completed.

Listing 8.11 Output Without and with Optimization

$ cc loop_merge.c

$ ./a.out

Thread 1 started

Thread 2 started

Thread 0 started

Thread 3 started

Thread 1 done

Thread 0 done

Thread 2 done

Thread 3 done

$ cc -O loop_merge.c $ ./a.out

Thread 0 started

Thread 0 done

Thread 1 started

Thread 1 done

Thread 2 started

Thread 2 done

Thread 3 started

Thread 3 done

With optimization, the compiler has merged the code in the two loops that set the start variable and read the end variable. This produces code similar to that shown in Listing 8.12. The two loops contain no function calls, and the compiler considers the accesses to the volatile variables to be without side effects. Hence, the compiler considers it safe to merge the two loops.

Listing 8.12 Compiler-Optimized Code

int main()

{

pthread_t thread[4]; for( int i=0; i<4; i++ )

{

pthread_create( &thread[i], 0, work, (void*)i ); done[i] = 0;

}

for ( int i=0; i<4; i++ )

{

start[i] = 1;

while( done[i] == 0 ){}

}

for ( int i=0; i<4;i++ )

{

pthread_join( thread[i], 0 );

}

To correct this situation, we need to modify the code so that the compiler is unable to merge the two loops. The easiest way to do this is to place a function call either in one of the loops or between the two loops. An alternative approach would be to separate the two loops with serial code that is unable to be moved because of some dependency or potential aliasing issue. However, both of these approaches will add some unnecessary instructions into the code, and both approaches run the risk that a “smart” compiler might identify the instructions as unnecessary and remove them.

There is a gcc asm("":::"memory") construct, supported by multiple compilers, that can be used to cause the compiler to correctly order the loops. Listing 8.13 shows the code modified to use this statement. This statement stops the loops from merging and adds no additional instructions into the code.

Listing 8.13 Using gcc asm() Statement to Cause Correct Operation Ordering

int main()

{

pthread_t thread[4];

for ( int i=0; i<4; i++ )

{

pthread_create( &thread[i], 0, work, (void*)i ); done[i] = 0;

}

for ( int i=0; i<4; i++ )

{

start[i] = 1;

}

asm volatile( "": : : "memory" );

for ( int i=0; i<4; i++ )

{

while( done[i] == 0 ){}

}

for ( int i=0; i<4;i++ )

{

pthread_join( thread[i], 0 );

}

Windows provides intrinsics for this purpose. The functions _ReadBarrier(),

_WriteBarrier(), and _ReadWriteBarrier() are defined in <intrin.h>. These intrinsics tell the compiler not to reorder the operations. A _ReadBarrier() call ensures that all reads will have completed at that point, while a _WriteBarrier() call ensures that all writes have completed. These instructions only enforce the compiler ordering and do not generate any instructions.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Hand-Coded Synchronization and Sharing : Reordering of Operations by the Compiler |