Language
Extensions
Perhaps the most obvious place to add support for parallelism is through
extensions to existing languages. The language C++ because of its expressive
power is a good vehicle for extensions such as Intel’s Threading Building
Blocks. Cilk++, again from Intel, pro-vides a set of language extensions that
can be preprocessed to generate the appropriate code. This section explores a
number of ways that language extensions can be used to enable the development
of parallel applications.
Threading
Building Blocks
Intel’s Threading Building Blocks (TBB) is a C++ library that uses
templates to provide common parallelization functionality and supporting
objects. For example, the library provides fundamental objects such as
containers and synchronization objects. It also pro-vides supporting
functionality such as access to timers and memory management rou-tines. The TBB
library is open source and ported to multiple platforms.
Some of the functionality provided by the library could be used as
objects in conven-tional code. The code in Listing 10.2 uses the concurrent
queue template to manage communication between a producer and consumer thread.
Listing 10.2 Using
a Concurrent Queue in a Pthread Producer-Consumer Model
#include "tbb/concurrent_queue.h"
#include <stdio.h>
#include
<pthread.h>
using namespace tbb; using namespace std;
concurrent_queue<int> queue;
extern
"C"
{
void * producer( void* )
{
for( int i=0; i<100; i++ ) { queue.push(
i ); } return 0;
}
void * consumer( void* )
{
int value = 0;
while ( value != 99 )
{
queue.pop(
value );
printf( "Value=%i\n", value );
}
return 0;
}
}
int
main()
{
pthread_t threads[2];
pthread_create( &threads[0], 0, producer, 0 );
pthread_create( &threads[1], 0, consumer, 0 ); pthread_join( threads[1], 0
);
pthread_join( threads[0], 0 ); return 0;
}
The two POSIX threads in the Listing 10.2 use the concurrent queue to
pass integer values from the producer to the consumer. Although providing a
large set of building block functionality is very useful, the more important
features are those that manage per-forming parallel work. The use of TBB allows
the developer to abstract away from POSIX threads and concentrate on the
algorithm.
The code in Listing 10.3 uses TBB to parallelize
the code to compute points in the Mandelbrot set. The code declares a Mandle object that contains all the methods that compute whether a given point
is in the set. The operator() is
overloaded to perform the computations over a 2D range. The computation is
executed by a call to the parallel_for function,
which takes the range over which the computation is to be
performed as parameters and the object receiving
that range.
Listing 10.3 Using
TBB to Parallelize Mandelbrot Set Generation
#include "tbb/parallel_for.h" #include
"tbb/blocked_range2d.h"
using namespace tbb;
const int SIZE = 500;
int data[SIZE][SIZE];
class Mandle
{
int inset( double ix, double iy ) const
{
int iterations = 0;
double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;
while (( x3 + y3 < 4 ) && ( x2 + y2 < 4 ) && (
iterations < 1000 ) )
{
y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x;
x3 = x2; y2 = y * y; y3 = y2; iterations++;
}
return iterations;
}
public:
void
operator()( const blocked_range2d<int,int>&range ) const
{
for( int y = range.rows().begin(); y != range.rows().end(); y++ )
{
for( int x = range.cols().begin(); x != range.cols().end(); x++ )
{
double xv = 0.75 * ( (double)(x - SIZE/2) ) / (double)(SIZE/4); double
yv = 0.75 * ( (double)(y - SIZE/2) ) / (double)(SIZE/4); data[x][y] =
inset(xv,yv);
}
}
}
Mandle(){}
};
int
main()
{
parallel_for(
blocked_range2d<int,int>(0,500,10,0,500,10), Mandle() );
return 0;
}
The code does not know what range it will be
assigned until runtime. This enables the runtime library to decide how many
threads are available and divide the work among the threads appropriately. The
runtime will take the original range object and break it down into an
appropriate number of smaller ranges. It will be these range objects that are
actually passed into the Mandle object.
The Cilk++ language extensions from Intel provide parallel functionality
for applications using a relatively small set of new keywords. Support is
currently limited to x86 proces-sors running either Windows or Linux. The
language is very close to OpenMP in terms of what it provides. Listing 10.4
shows a Cilk++ version of the Mandelbrot example. To convert the C language
code to Cilk++ code, it was necessary to replace the main() function with cilk_main(). This
change ensures that the Cilk runtime environment is initialized. The other
necessary change was to convert the outermost for loop into a cilk_for
loop.
The program is built using the Cilk++ preprocessor cilkpp, which then invokes the C++ compiler on the resulting code.
Listing 10.4 Mandelbrot
Example Using cilk_for
#define
SIZE 300
int
inset( double ix, double iy )
{
int iterations = 0;
double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;
while ( ( x3 + y3 < 4) && ( x2 + y2 < 4 ) && (
iterations < 1000 ) )
{
y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x;
x3 = x2; y2 = y * y; y3 = y2; iterations++;
}
return iterations;
}
int
cilk_main(int argc, _TCHAR* argv[])
{
int data[SIZE][SIZE];
cilk_for( int
y=0; y<SIZE; y++ )
{
for( int x=0; x<SIZE; x++ )
{
double xv = 0.75 * ( (double)(x – SIZE/2) ) / (double)(SIZE/4); double
yv = 0.75 * ( (double)(y – SIZE/2) ) / (double)(SIZE/4); data[x][y] =
inset(xv,yv);
}
}
getchar(); return 0;
}
Cilk++ offers more than just a
parallel for construct. A more important feature is cilk_spawn/cilk_sync. This
allows the application to spawn worker threads and then
wait for them to complete. This is close to the
concept of OpenMP tasks.
We can take the OpenMP quicksort
example from Listing 9.15 from the previous chapter and convert it into
equivalent Cilk++ code shown in Listing 10.5. The cilk_spawn
statement causes the following code to be,
potentially, executed by another thread. Every function ends with an implicit clik_sync statement, so the function can spawn multiple threads but will not
complete until those threads have also completed.
Listing 10.5 Quicksort
Written in Cilk++
#include <stdio.h> #include <stdlib.h>
void setup( int * array, int len )
{
for( int i=0; i<len; i++ ) { array[i] = ( i*7 – 3224 ) ^ 20435; }
}
void output( int * array, int len )
{
for( int i=0; i<len; i++ ) { printf( "%8i", array[i] ); }
}
void quick_sort_range( int * array, int lower, int upper )
{
int tmp;
int mid = ( upper + lower ) / 2; int pivot = array[mid];
int tlower = lower, tupper = upper; while ( tlower
<= tupper )
{
while ( array[tlower] < pivot ) { tlower++; }
while ( array[tupper] > pivot ) { tupper--; } if ( tlower <= tupper )
{
tmp = array[tlower]; array[tlower] = array[tupper];
array[tupper] = tmp; tupper--;
tlower++;
}
}
if ( lower < tupper )
{
cilk_spawn
quick_sort_range( array, lower, tupper );
}
if ( tlower < upper )
{
cilk_spawn
quick_sort_range( array, tlower, upper );
}
}
void
quick_sort( int *array, int elements )
{
quick_sort_range( array, 0, elements );
}
int
cilk_main()
{
int size = 10*1024*1024;
int * array = (int*)malloc( sizeof(int)*size );
setup( array, size );
quick_sort( array, size-1 ); output( array, size ); return 0;
}
The Cilk++ environment also provides two tools that are useful for
developers of parallel applications: a data race detection tool called cilkscreen and a performance prediction tool called cilkview. The data race tool cilkscreen provides
functionality that is
common to most similar tools. The performance tool cilkview provides some-thing different—a high-level report on the scalability of
the application. This is of less use than a line-by-line profile but does
provide an interesting insight into how the appli-cation may scale as the
number of threads increases. The output in Listing 10.6 shows cilkview
run on the quicksort code (modified so as not to
print out the list of the sorted values).
Listing 10.6 Scalability
Estimates from cilkview
$ cilkview cilkquicksort
cilkview:
generating scalability data Whole Program Statistics:
Cilkview
Scalability Analyzer V1.1.0, Build 8504
1) Parallelism
Profile
Work : 5,216,903,917 instructions
Span : 338,025,076 instructions
Burdened span
: 339,522,925 instructions
Parallelism : 15.43
Burdened
parallelism : 15.37
Number of
spawns/syncs: 9,106,354
Average
instructions / strand : 190
Strands along
span : 119
Average
instructions / strand on span : 2,840,546
Total number
of atomic instructions : 9,106,371
Frame count : 18,212,713
2) Speedup
Estimate
2 processors: 1.80 -
2.00
4 processors: 3.00 -
4.00
8 processors: 4.51 -
8.00
16
processors: 6.02 - 15.43
32
processors: 7.22 – 15.43
The tool reports a number of statistics about the run of the program.
The more inter-esting result is in the section showing estimates of speedup for
various numbers of threads. The tool reports that the best-case estimate for
scaling is just over 15 times faster at 16 threads. A lower bound for scaling
is that the code will never get much more than eight times faster regardless of
the number of threads used.
Grand
Central Dispatch
Grand Central Dispatch (GCD) from Apple is an approach to task-based
parallelism. It is very similar to the concept of an OpenMP task or a cilk_spawn call. The core func-tionality of GCD is that a block of work can be
placed onto a dispatch queue. When hardware resource is available, the work is
removed from the queue and completed. There are various ways that work can be
dispatched to a queue. For example, the dispatch_async()
function dispatches work to a queue and immediately
returns to the calling code, having placed
the work on the queue for later completion. In contrast, the dispatch_apply() function places work on a queue but does not return until all the work
is completed. Listing 10.7 shows an example of using the dispatch_apply() function to parallelize the calculation of the Mandelbrot set.
Listing 10.7 Mandelbrot
Example Using dispatch_apply
#define
SIZE 300
int
inset( double ix, double iy )
{
int iterations = 0;
double x = ix, y = iy, x2 = x*x, y2 = y*y; double x3 = 0,y3 = 0;
while ( ( x3 + y3 < 4 ) && ( x2 + y2 < 4 ) && (
iterations < 1000 ) )
{
y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x;
x3 = x2; y2 = y * y; y3 = y2; iterations++;
}
return iterations;
}
int
main( int argc, _TCHAR* argv[] )
{
int data[SIZE][SIZE];
dispatch_apply(
SIZE, dispatch_get_global_queue( 0, 0 ),
^( size_t y )
{
for( int x=0; x<SIZE; x++ )
{
double xv = 0.75 * ( (double)(x – SIZE/2) ) /
(double)(SIZE/4); double yv = 0.75 * ( (double)(y – SIZE/2) ) /
(double)(SIZE/4); data[x][y] = inset( xv, yv );
}
}
);
getchar();
return 0;
}
The dispatch_apply() call
takes three parameters. The first parameter is the num-ber of iterations. The
second parameter is the queue to which the work is to be dis-patched. An
application may create and manage multiple queues; the global queue used in the
example is the queue representing all available processors. The final parameter
is a block variable. This contains a block
literal containing the code that each iteration of the loop needs to
execute. The block literal is denoted with a caret (^). This block of code is actually passed as a parameter into the
routine; hence, the closing bracket of the function call is after the closing
bracket for the block of code.
Features
Proposed for the Next C and C++ Standards
The next revisions of both the C and C++ standards contain some features
that will help in the development of multithreaded codes. In previous chapters,
we looked at the differ-ences between POSIX threads and Windows threads. The
functionality provided by the two implementations is very similar, but there
are many differences in how that function-ality is exported to the developer.
Incorporating some of this support into the language standards will be a huge
improvement in the ability to write portable parallel code.
At the time of writing, it is not certain whether these features will be
accepted or modified in the final version. Consequently, it is not possible to
write compilable code that demonstrates these new features, both because the
features may change in the final version and because there are no compilers
that implement these features. However, there seems to be three areas that the
languages are standardizing.
The first area is that of creating and managing threads. In C, there are
likely to be a set of function calls, reminiscent of the POSIX function calls
for creating and managing threads. The example in Listing 10.8 shows the
current approach for creating and run-ning a new thread. The resulting code
looks very similar to both the Windows and POSIX code.
Listing 10.8 Creating
a New Thread in the Proposed Next C Standard
#include <threads.h> #include <stdio.h>
int * work( void * )
{
printf( "Child thread\n" ); return 0;
}
int main()
{
thrd_t
thread;
thrd_create(
&thread, work, 0 );
thrd_join(
&thread, 0 );
}
In C++ it is natural for threads to be objects. The
code in Listing 10.9 is an equiva-lent code that creates and manages a child
thread. The work that is to be performed is passed in as a function to the
created thread object.
Listing 10.9 Creating
a New Thread in the Proposed Next C++ Standard
#include <thread> #include <iostream>
work()
{
std::cout << "Child thread\n";
}
int
main()
{
std::thread
mythread( work );
mythread.join();
}
Both standards also propose support for atomic
operations and memory ordering. In C, atomic operations are supported through
providing atomic types and operations that can be performed on those types. The
code in Listing 10.10 demonstrates how an atomic variable can be incremented.
Listing 10.10 Atomic
Increment in C Proposal
#include
<stdatomic.h>
int
count()
{
static struct
atomic_int value = 0;
atomic_fetch_add(
&value, 1 );
printf( "Counter=%i\n", value );
}
C++ defines the atomic types as structs and
provides methods for accessing them. Listing 10.11 shows equivalent C++ code.
Listing 10.11 Atomic
Increment in C++ Proposal
#include <atomic> #include <iostream>
int
count()
{
static
atomic_int value = 0;
value += 1;
printf( "Counter=%i\n", value );
}
The final area of standardization treats mutexes
and condition variables. The code in Listing 10.12 demonstrates how, with the
proposed C standard, a mutex could be created and used to provide atomic access
to a shared variable. The call to mtx_init() takes a parameter to describe the type of mutex that is required, which
provides the facility for the mutex to have different runtime behaviors, such
as having a timeout.
Listing 10.12 Protecting
a Shared Variable with a Mutex in C
#include <threads.h> #include <stdio.h>
int counter = 0;
mtx_t mutex;
int * work( void * )
{
mtx_lock(
&mutex );
counter++;
printf( "Counter = %i\n", counter );
mtx_unlock(
&mutex );
return 0;
}
int main()
{
thrd_t thread;
mtx_init(
&mutex, mtx_plain );
thrd_create( &thread, work, 0 );
thrd_join( &thread, 0 );
mtx_destroy(
&mutex );
}
The C++ proposed standard implements a class
hierarchy of different kinds of mutex. Listing 10.13 shows the equivalent code
to protect a shared variable.
Listing 10.13 Protecting
a Shared Variable with a Mutex Object in C++
#include <thread> #include <iostream>
int counter = 0;
std::mutex mymutex;
struct
work operator()
{
mymutex.lock();
std::cout<< "Counter " << counter <<
"\n";
mymutex.unlock();
}
int
main()
{
std::thread mythread{ work() }; mthread.join();
}
Microsoft's
C++/CLI
Visual Studio supports a C++/Common Language Infrastructure (CLI), which
is a C++-derived language that runs on the .NET Common Language Runtime (CLR).
As with all managed languages, it compiles to bytecode, which is then run on a
virtual machine. The language provides a set of objects for synchronization and
threads.
The code in Listing 10.14 illustrates how a main
thread can create a child thread. The main thread prints one set of output, and
the child thread prints another. The main thread waits for a key press before
exiting.
Listing 10.14 Creating
a Child Thread in C++/CLI
using
namespace System;
using
namespace System::Threading;
void
work()
{
Console::WriteLine( L"Work Thread" );
}
int
main( array< System::String ^> ^args )
{
Thread^
mythread = gcnew Thread( gcnew ThreadStart( &work ) );
mythread->Start();
Console::WriteLine( L"Main Thread" );
Console::ReadKey();
return 0;
}
The Thread object is the abstraction for a thread. To produce a running thread, an
object needs to be created and then started by calling its Start() method. The gcnew() function
creates a managed object on the garbage-collected heap and returns a handle to
it. A variable that holds a handle to a particular type is indicated with a
carat (^) charac-ter. The thread requires the address of a routine to execute.
This is provided through a ThreadStart object,
and this object is created with the address of the desired routine.
As previously mentioned, the language provides the
familiar synchronization primi-tives. For example, we can use a mutex to ensure
that the main thread prints its output before the child thread.
The code then becomes more complex because it is
not possible to use managed global variables. One way around this is to convert
the code that the child thread exe-cutes into a method in an object. Listing
10.15 defines a ThreadWork object
that con-tains the handle of a Mutex object;
this is set when the object is created.
Listing 10.15 Using
a Mutex to Enforce Ordering
using namespace System;
using namespace System::Threading;
public ref class ThreadWork
{
private:
Mutex^
mymutex;
public:
ThreadWork(
Mutex^ m )
{
mymutex = m;
}
void work()
{
mymutex->WaitOne();
Console::WriteLine( L"Child thread" );
mymutex->ReleaseMutex();
}
};
int main( array<System::String ^> ^args )
{
Mutex^
mymutex = gcnew Mutex();
ThreadWork^ T
= gcnew ThreadWork( mymutex );
mymutex->WaitOne();
Thread^
mythread = gcnew Thread( gcnew
ThreadStart(
T, &ThreadWork::work ) );
mythread->Start();
Console::WriteLine(
L"Main Thread" );
mymutex->ReleaseMutex();
mythread->Join();
Console::ReadKey();
return 0;
}
The Mutex object is acquired through a call to the method WaitOne() and released through a call to the method ReleaseMutex().
The first thing the code does is to create the Mutex object, and the handle of this is passed into the ThreadWork object. The main thread acquires the mutex by calling WaitOne(). This will stop the child thread from executing until after the main
thread has printed its output.
The ThreadStart object now needs to be initialized with two parameters. The first
parameter is the instantiation, T, of a ThreadWork object,
and the second parameter is the method to be run by the thread.
Once the main thread has printed its output to the
console, it releases the mutex and then calls the Join() method of the Thread object
to wait for the child thread to complete.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.