Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Other Parallelization Technologies

GPU-Based Computing

An approach to parallelism that has recently evolved is the use of graphics co-processors as accelerators for computation.

GPU-Based Computing

An approach to parallelism that has recently evolved is the use of graphics co-processors as accelerators for computation. This came about because as the requirements for fast and detailed graphical representations evolved, the hardware to implement them began to increasingly resemble hardware that could perform fast parallel floating-point or integer computation. Recently, there have been multiple efforts to export the ability to perform computation on graphics processing units (GPUs) to common programming languages. The most well-known of these are Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). CUDA is specific to Nvidia, whereas OpenCL is supported on GPUs from both Nvidia and ATI.

Although this approach utilizes many cores to perform computations in parallel, the details of the approach are quite different from all the other approaches discussed in this book. The most important consideration is that GPUs represent compute co-processors, and there are several constraints with co-processors.

The first is that they need not, and in the case of GPUs do not, share the instruction set of the host processor. This means that producing code that runs on the GPU requires a more complex tool chain. This has to produce code for the host processor, together with code for the GPU, and then bind the two different sets of code together into a single executable.

The second problem is that GPUs do not share the same address space as the host processor. Data needs to be copied across to the GPU. The act of copying is time-consuming and forces the problem being tackled to have to be large enough to justify the cost of the copy operation.

These problems can be, if not hidden, at least reduced by the language used to pro-gram the GPU. Listing 10.1 shows a simple program written using CUDA. The program written in OpenCL would look broadly similar.

Listing 10.1 Simple CUDA Program

#include "cuda.h"

#define LEN 100000

// GPU code

__global__ void square( float *data, int length )

{

int index = blockIdx.x * blockDim.x + threadIdx.x; if ( index < length )

{

data[index] = data[index] * data[index];

}

//Host code int main()

{

float *host_data, *gpu_data; int ThreadsPerBlock, Blocks;

// Allocate memory

host_data = (float*)malloc( LEN * sizeof(float) ); cudaMalloc( &gpu_data, LEN*sizeof(float) );

// Initialize data on host

for( int i=0; i<LEN; i++ )

{

host_data[i] = 2*i;

}

// Copy host data to GPU

cudaMemcpy( gpu_data, host_data, LEN*sizeof(int), cudaMemcpyHostToDevice );

// Perform computation on GPU ThreadsPerBlock = 128;

Blocks = (int)( (LEN-1) / ThreadsPerBlock ) + 1; square <<<Blocks, ThreadsPerBlock>>>( gpu_data, LEN );

// Copy GPU data back to host

cudaMemcpy( gpu_data, host_data, LEN*sizeof(int), cudaMemcpyDeviceToHost);

// Free allocated memory cudaFree( gpu_data ); free( host_data );

}

The code is in two routines. The routine square() contains the code that is actually executed by the GPU. The routine main() is executed by the host processor.

It is best to describe the routine main() first because this code performs the prepara-tion for the parallel work. This routine needs to allocate memory both on the host system, using malloc(), and on the GPU, using the call cudaMalloc(). The method of passing data between the host system and the GPU is to copy the data from the host system to the GPU. Hence, it is necessary to reserve space for the data in both places.

The host_data array is initialized and then copied over to the GPU by calling cudaMemcpy(). The cudaMemcpy() call is used for transferring data in both directions, and the direction of the copy is determined by the last parameter passed into the function. Copying data to and from the GPU requires the data to be sent across the bus that connects the processor to the GPU. This bus provides relatively low bandwidth, perhaps 8GB/s to 16GB/s. Once the data is transferred, the GPU is able to sustain much higher aggregate bandwidths on the order of 100GB/s. Therefore, transferring data to and from the GPU is to be avoided as much as possible.

Each GPU supports a large number of software threads. These threads are arranged in groups called blocks. The main program assigns work to a block of threads, and each thread in the block executes the same routine. The number of threads in a block can be set by the code. It should be a multiple of 32 and can be as large as 512 threads.

The code takes the number of threads per block and uses this to calculate how many blocks are needed to complete the work. The function call syntax has been extended so that the call to square() takes both the normal parameters and the details of the number of blocks and the number of threads per block. This function call causes the GPU to execute the code using the specified number of blocks and threads per block.

Once the call to square() completes, the host copies the resulting data back from the device into host memory using a second cudaMemcpy() call. The last actions allow the host machine to free up the memory allocated on the host and on the GPU.

The function square() is declared with the __global__ keyword to indicate that it is a function that executes on the GPU but can be called by the host code. Each hard-ware thread on the GPU will execute the routine. The first thing the hardware thread needs to do is determine the index of the element that it needs to compute. Information about the topology of the block of threads is passed into the thread in the structures blockIdx, blockDim, and threadIdx. These three structures allow the CUDA framework the flexibility to specify that the thread performs computation in some three-dimensional space. However, in this example we are working on only a single, x, dimension. The index that a particular thread should compute can be derived by multiplying the index of the block that is currently being computed by the size of each block and then adding the index of the current thread. The thread will then work on the element at that index.

In the Listing 10.1, the computation performed is trivial, but much more complex work can be performed on the GPU. The theoretical performance can reach teraflops of floating-point computations per second. This can make for a compelling solution for codes that require large amounts of computation.

There is another important point to make. The current trend is to see increasing numbers of general-purpose CPUs on the same chip. It is quite likely that in the future we will see processors that resemble today’s GPUs. Consequently, understanding how codes can be scaled to this level of threading is likely to be a useful skill in the future.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Other Parallelization Technologies : GPU-Based Computing |