Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Other Parallelization Technologies

Language Extensions

Perhaps the most obvious place to add support for parallelism is through extensions to existing languages.

Language Extensions

Perhaps the most obvious place to add support for parallelism is through extensions to existing languages. The language C++ because of its expressive power is a good vehicle for extensions such as Intel’s Threading Building Blocks. Cilk++, again from Intel, pro-vides a set of language extensions that can be preprocessed to generate the appropriate code. This section explores a number of ways that language extensions can be used to enable the development of parallel applications.

Threading Building Blocks

Intel’s Threading Building Blocks (TBB) is a C++ library that uses templates to provide common parallelization functionality and supporting objects. For example, the library provides fundamental objects such as containers and synchronization objects. It also pro-vides supporting functionality such as access to timers and memory management rou-tines. The TBB library is open source and ported to multiple platforms.

Some of the functionality provided by the library could be used as objects in conven-tional code. The code in Listing 10.2 uses the concurrent queue template to manage communication between a producer and consumer thread.

Listing 10.2 Using a Concurrent Queue in a Pthread Producer-Consumer Model

#include "tbb/concurrent_queue.h" #include <stdio.h>

#include <pthread.h>

using namespace tbb; using namespace std;

concurrent_queue<int> queue;

extern "C"

{

void * producer( void* )

{

for( int i=0; i<100; i++ ) { queue.push( i ); } return 0;

}

void * consumer( void* )

{

int value = 0;

while ( value != 99 )

{

queue.pop( value );

printf( "Value=%i\n", value );

}

return 0;

}

int main()

{

pthread_t threads[2];

pthread_create( &threads[0], 0, producer, 0 ); pthread_create( &threads[1], 0, consumer, 0 ); pthread_join( threads[1], 0 );

pthread_join( threads[0], 0 ); return 0;

}

The two POSIX threads in the Listing 10.2 use the concurrent queue to pass integer values from the producer to the consumer. Although providing a large set of building block functionality is very useful, the more important features are those that manage per-forming parallel work. The use of TBB allows the developer to abstract away from POSIX threads and concentrate on the algorithm.

The code in Listing 10.3 uses TBB to parallelize the code to compute points in the Mandelbrot set. The code declares a Mandle object that contains all the methods that compute whether a given point is in the set. The operator() is overloaded to perform the computations over a 2D range. The computation is executed by a call to the parallel_for function, which takes the range over which the computation is to be performed as parameters and the object receiving that range.

Listing 10.3 Using TBB to Parallelize Mandelbrot Set Generation

#include "tbb/parallel_for.h" #include "tbb/blocked_range2d.h"

using namespace tbb;

const int SIZE = 500;

int data[SIZE][SIZE];

class Mandle

{

int inset( double ix, double iy ) const

{

int iterations = 0;

double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;

while (( x3 + y3 < 4 ) && ( x2 + y2 < 4 ) && ( iterations < 1000 ) )

{

y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x; x3 = x2; y2 = y * y; y3 = y2; iterations++;

}

return iterations;

}

public:

void operator()( const blocked_range2d<int,int>&range ) const

{

for( int y = range.rows().begin(); y != range.rows().end(); y++ )

{

for( int x = range.cols().begin(); x != range.cols().end(); x++ )

{

double xv = 0.75 * ( (double)(x - SIZE/2) ) / (double)(SIZE/4); double yv = 0.75 * ( (double)(y - SIZE/2) ) / (double)(SIZE/4); data[x][y] = inset(xv,yv);

}

Mandle(){}

};

int main()

{

parallel_for( blocked_range2d<int,int>(0,500,10,0,500,10), Mandle() );

return 0;

}

The code does not know what range it will be assigned until runtime. This enables the runtime library to decide how many threads are available and divide the work among the threads appropriately. The runtime will take the original range object and break it down into an appropriate number of smaller ranges. It will be these range objects that are actually passed into the Mandle object.

Cilk++

The Cilk++ language extensions from Intel provide parallel functionality for applications using a relatively small set of new keywords. Support is currently limited to x86 proces-sors running either Windows or Linux. The language is very close to OpenMP in terms of what it provides. Listing 10.4 shows a Cilk++ version of the Mandelbrot example. To convert the C language code to Cilk++ code, it was necessary to replace the main() function with cilk_main(). This change ensures that the Cilk runtime environment is initialized. The other necessary change was to convert the outermost for loop into a cilk_for loop.

The program is built using the Cilk++ preprocessor cilkpp, which then invokes the C++ compiler on the resulting code.

Listing 10.4 Mandelbrot Example Using cilk_for

#define SIZE 300

int inset( double ix, double iy )

{

int iterations = 0;

double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;

while ( ( x3 + y3 < 4) && ( x2 + y2 < 4 ) && ( iterations < 1000 ) )

{

y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x; x3 = x2; y2 = y * y; y3 = y2; iterations++;

}

return iterations;

}

int cilk_main(int argc, _TCHAR* argv[])

{

int data[SIZE][SIZE];

cilk_for( int y=0; y<SIZE; y++ )

{

for( int x=0; x<SIZE; x++ )

{

double xv = 0.75 * ( (double)(x – SIZE/2) ) / (double)(SIZE/4); double yv = 0.75 * ( (double)(y – SIZE/2) ) / (double)(SIZE/4); data[x][y] = inset(xv,yv);

}

getchar(); return 0;

}

Cilk++ offers more than just a parallel for construct. A more important feature is cilk_spawn/cilk_sync. This allows the application to spawn worker threads and then wait for them to complete. This is close to the concept of OpenMP tasks.

We can take the OpenMP quicksort example from Listing 9.15 from the previous chapter and convert it into equivalent Cilk++ code shown in Listing 10.5. The cilk_spawn statement causes the following code to be, potentially, executed by another thread. Every function ends with an implicit clik_sync statement, so the function can spawn multiple threads but will not complete until those threads have also completed.

Listing 10.5 Quicksort Written in Cilk++

#include <stdio.h> #include <stdlib.h>

void setup( int * array, int len )

{

for( int i=0; i<len; i++ ) { array[i] = ( i*7 – 3224 ) ^ 20435; }

}

void output( int * array, int len )

{

for( int i=0; i<len; i++ ) { printf( "%8i", array[i] ); }

}

void quick_sort_range( int * array, int lower, int upper )

{

int tmp;

int mid = ( upper + lower ) / 2; int pivot = array[mid];

int tlower = lower, tupper = upper; while ( tlower <= tupper )

{

while ( array[tlower] < pivot ) { tlower++; } while ( array[tupper] > pivot ) { tupper--; } if ( tlower <= tupper )

{

tmp = array[tlower]; array[tlower] = array[tupper]; array[tupper] = tmp; tupper--;

tlower++;

}

if ( lower < tupper )

{

cilk_spawn quick_sort_range( array, lower, tupper );

}

if ( tlower < upper )

{

cilk_spawn quick_sort_range( array, tlower, upper );

}

void quick_sort( int *array, int elements )

{

quick_sort_range( array, 0, elements );

}

int cilk_main()

{

int size = 10*1024*1024;

int * array = (int*)malloc( sizeof(int)*size ); setup( array, size );

quick_sort( array, size-1 ); output( array, size ); return 0;

}

The Cilk++ environment also provides two tools that are useful for developers of parallel applications: a data race detection tool called cilkscreen and a performance prediction tool called cilkview. The data race tool cilkscreen provides functionality that is common to most similar tools. The performance tool cilkview provides some-thing different—a high-level report on the scalability of the application. This is of less use than a line-by-line profile but does provide an interesting insight into how the appli-cation may scale as the number of threads increases. The output in Listing 10.6 shows cilkview run on the quicksort code (modified so as not to print out the list of the sorted values).

Listing 10.6 Scalability Estimates from cilkview

$ cilkview cilkquicksort

cilkview: generating scalability data Whole Program Statistics:

Cilkview Scalability Analyzer V1.1.0, Build 8504

1) Parallelism Profile

Work : 5,216,903,917 instructions

Span : 338,025,076 instructions

Burdened span : 339,522,925 instructions

Parallelism : 15.43

Burdened parallelism : 15.37

Number of spawns/syncs: 9,106,354

Average instructions / strand : 190

Strands along span : 119

Average instructions / strand on span : 2,840,546

Total number of atomic instructions : 9,106,371

Frame count : 18,212,713

2) Speedup Estimate

2 processors: 1.80 - 2.00

4 processors: 3.00 - 4.00

8 processors: 4.51 - 8.00

16 processors: 6.02 - 15.43

32 processors: 7.22 – 15.43

The tool reports a number of statistics about the run of the program. The more inter-esting result is in the section showing estimates of speedup for various numbers of threads. The tool reports that the best-case estimate for scaling is just over 15 times faster at 16 threads. A lower bound for scaling is that the code will never get much more than eight times faster regardless of the number of threads used.

Grand Central Dispatch

Grand Central Dispatch (GCD) from Apple is an approach to task-based parallelism. It is very similar to the concept of an OpenMP task or a cilk_spawn call. The core func-tionality of GCD is that a block of work can be placed onto a dispatch queue. When hardware resource is available, the work is removed from the queue and completed. There are various ways that work can be dispatched to a queue. For example, the dispatch_async() function dispatches work to a queue and immediately returns to the calling code, having placed the work on the queue for later completion. In contrast, the dispatch_apply() function places work on a queue but does not return until all the work is completed. Listing 10.7 shows an example of using the dispatch_apply() function to parallelize the calculation of the Mandelbrot set.

Listing 10.7 Mandelbrot Example Using dispatch_apply

#define SIZE 300

int inset( double ix, double iy )

{

int iterations = 0;

double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;

while ( ( x3 + y3 < 4 ) && ( x2 + y2 < 4 ) && ( iterations < 1000 ) )

{

y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x; x3 = x2; y2 = y * y; y3 = y2; iterations++;

}

return iterations;

}

int main( int argc, _TCHAR* argv[] )

{

int data[SIZE][SIZE];

dispatch_apply( SIZE, dispatch_get_global_queue( 0, 0 ),

^( size_t y )

{

for( int x=0; x<SIZE; x++ )

{

double xv = 0.75 * ( (double)(x – SIZE/2) ) / (double)(SIZE/4); double yv = 0.75 * ( (double)(y – SIZE/2) ) / (double)(SIZE/4); data[x][y] = inset( xv, yv );

}

);

getchar();

return 0;

}

The dispatch_apply() call takes three parameters. The first parameter is the num-ber of iterations. The second parameter is the queue to which the work is to be dis-patched. An application may create and manage multiple queues; the global queue used in the example is the queue representing all available processors. The final parameter is a block variable. This contains a block literal containing the code that each iteration of the loop needs to execute. The block literal is denoted with a caret (^). This block of code is actually passed as a parameter into the routine; hence, the closing bracket of the function call is after the closing bracket for the block of code.

Features Proposed for the Next C and C++ Standards

The next revisions of both the C and C++ standards contain some features that will help in the development of multithreaded codes. In previous chapters, we looked at the differ-ences between POSIX threads and Windows threads. The functionality provided by the two implementations is very similar, but there are many differences in how that function-ality is exported to the developer. Incorporating some of this support into the language standards will be a huge improvement in the ability to write portable parallel code.

At the time of writing, it is not certain whether these features will be accepted or modified in the final version. Consequently, it is not possible to write compilable code that demonstrates these new features, both because the features may change in the final version and because there are no compilers that implement these features. However, there seems to be three areas that the languages are standardizing.

The first area is that of creating and managing threads. In C, there are likely to be a set of function calls, reminiscent of the POSIX function calls for creating and managing threads. The example in Listing 10.8 shows the current approach for creating and run-ning a new thread. The resulting code looks very similar to both the Windows and POSIX code.

Listing 10.8 Creating a New Thread in the Proposed Next C Standard

#include <threads.h> #include <stdio.h>

int * work( void * )

{

printf( "Child thread\n" ); return 0;

}

int main()

{

thrd_t thread;

thrd_create( &thread, work, 0 );

thrd_join( &thread, 0 );

}

In C++ it is natural for threads to be objects. The code in Listing 10.9 is an equiva-lent code that creates and manages a child thread. The work that is to be performed is passed in as a function to the created thread object.

Listing 10.9 Creating a New Thread in the Proposed Next C++ Standard

#include <thread> #include <iostream>

work()

{

std::cout << "Child thread\n";

}

int main()

{

std::thread mythread( work );

mythread.join();

}

Both standards also propose support for atomic operations and memory ordering. In C, atomic operations are supported through providing atomic types and operations that can be performed on those types. The code in Listing 10.10 demonstrates how an atomic variable can be incremented.

Listing 10.10 Atomic Increment in C Proposal

#include <stdatomic.h>

int count()

{

static struct atomic_int value = 0;

atomic_fetch_add( &value, 1 );

printf( "Counter=%i\n", value );

}

C++ defines the atomic types as structs and provides methods for accessing them. Listing 10.11 shows equivalent C++ code.

Listing 10.11 Atomic Increment in C++ Proposal

#include <atomic> #include <iostream>

int count()

{

static atomic_int value = 0;

value += 1;

printf( "Counter=%i\n", value );

}

The final area of standardization treats mutexes and condition variables. The code in Listing 10.12 demonstrates how, with the proposed C standard, a mutex could be created and used to provide atomic access to a shared variable. The call to mtx_init() takes a parameter to describe the type of mutex that is required, which provides the facility for the mutex to have different runtime behaviors, such as having a timeout.

Listing 10.12 Protecting a Shared Variable with a Mutex in C

#include <threads.h> #include <stdio.h>

int counter = 0;

mtx_t mutex;

int * work( void * )

{

mtx_lock( &mutex );

counter++;

printf( "Counter = %i\n", counter );

mtx_unlock( &mutex );

return 0;

}

int main()

{

thrd_t thread;

mtx_init( &mutex, mtx_plain );

thrd_create( &thread, work, 0 );

thrd_join( &thread, 0 );

mtx_destroy( &mutex );

}

The C++ proposed standard implements a class hierarchy of different kinds of mutex. Listing 10.13 shows the equivalent code to protect a shared variable.

Listing 10.13 Protecting a Shared Variable with a Mutex Object in C++

#include <thread> #include <iostream>

int counter = 0;

std::mutex mymutex;

struct work operator()

{

mymutex.lock();

std::cout<< "Counter " << counter << "\n";

mymutex.unlock();

}

int main()

{

std::thread mythread{ work() }; mthread.join();

}

Microsoft's C++/CLI

Visual Studio supports a C++/Common Language Infrastructure (CLI), which is a C++-derived language that runs on the .NET Common Language Runtime (CLR). As with all managed languages, it compiles to bytecode, which is then run on a virtual machine. The language provides a set of objects for synchronization and threads.

The code in Listing 10.14 illustrates how a main thread can create a child thread. The main thread prints one set of output, and the child thread prints another. The main thread waits for a key press before exiting.

Listing 10.14 Creating a Child Thread in C++/CLI

using namespace System;

using namespace System::Threading;

void work()

{

Console::WriteLine( L"Work Thread" );

}

int main( array< System::String ^> ^args )

{

Thread^ mythread = gcnew Thread( gcnew ThreadStart( &work ) );

mythread->Start();

Console::WriteLine( L"Main Thread" );

Console::ReadKey();

return 0;

}

The Thread object is the abstraction for a thread. To produce a running thread, an object needs to be created and then started by calling its Start() method. The gcnew() function creates a managed object on the garbage-collected heap and returns a handle to it. A variable that holds a handle to a particular type is indicated with a carat (^) charac-ter. The thread requires the address of a routine to execute. This is provided through a ThreadStart object, and this object is created with the address of the desired routine.

As previously mentioned, the language provides the familiar synchronization primi-tives. For example, we can use a mutex to ensure that the main thread prints its output before the child thread.

The code then becomes more complex because it is not possible to use managed global variables. One way around this is to convert the code that the child thread exe-cutes into a method in an object. Listing 10.15 defines a ThreadWork object that con-tains the handle of a Mutex object; this is set when the object is created.

Listing 10.15 Using a Mutex to Enforce Ordering

using namespace System;

using namespace System::Threading;

public ref class ThreadWork

{

private:

Mutex^ mymutex;

public:

ThreadWork( Mutex^ m )

{

mymutex = m;

}

void work()

{

mymutex->WaitOne();

Console::WriteLine( L"Child thread" );

mymutex->ReleaseMutex();

}

};

int main( array<System::String ^> ^args )

{

Mutex^ mymutex = gcnew Mutex();

ThreadWork^ T = gcnew ThreadWork( mymutex );

mymutex->WaitOne();

Thread^ mythread = gcnew Thread( gcnew

ThreadStart( T, &ThreadWork::work ) );

mythread->Start();

Console::WriteLine( L"Main Thread" );

mymutex->ReleaseMutex();

mythread->Join();

Console::ReadKey();

return 0;

}

The Mutex object is acquired through a call to the method WaitOne() and released through a call to the method ReleaseMutex().

The first thing the code does is to create the Mutex object, and the handle of this is passed into the ThreadWork object. The main thread acquires the mutex by calling WaitOne(). This will stop the child thread from executing until after the main thread has printed its output.

The ThreadStart object now needs to be initialized with two parameters. The first parameter is the instantiation, T, of a ThreadWork object, and the second parameter is the method to be run by the thread.

Once the main thread has printed its output to the console, it releases the mutex and then calls the Join() method of the Thread object to wait for the child thread to complete.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Other Parallelization Technologies : Language Extensions |