GPU-Based
Computing
An approach to parallelism that has recently evolved is the use of
graphics co-processors as accelerators for computation. This came about because
as the requirements for fast and detailed graphical representations evolved,
the hardware to implement them began to increasingly resemble hardware that
could perform fast parallel floating-point or integer computation. Recently,
there have been multiple efforts to export the ability to perform computation
on graphics processing units (GPUs) to common programming languages. The most
well-known of these are Compute Unified Device Architecture (CUDA) and Open
Computing Language (OpenCL). CUDA is specific to Nvidia, whereas OpenCL is
supported on GPUs from both Nvidia and ATI.
Although this approach utilizes many cores to
perform computations in parallel, the details of the approach are quite
different from all the other approaches discussed in this book. The most
important consideration is that GPUs represent compute co-processors, and there
are several constraints with co-processors.
The first is that they need not,
and in the case of GPUs do not, share the instruction set of the host
processor. This means that producing code that runs on the GPU requires a more
complex tool chain. This has to produce code for the host processor, together
with code for the GPU, and then bind the two different sets of code together
into a single executable.
The second problem is that GPUs
do not share the same address space as the host processor. Data needs to be
copied across to the GPU. The act of copying is time-consuming and forces the
problem being tackled to have to be large enough to justify the cost of the
copy operation.
These problems can be, if not
hidden, at least reduced by the language used to pro-gram the GPU. Listing 10.1
shows a simple program written using CUDA. The program written in OpenCL would
look broadly similar.
Listing 10.1 Simple
CUDA Program
#include "cuda.h"
#define LEN 100000
// GPU code
__global__ void square( float *data, int length )
{
int index = blockIdx.x * blockDim.x + threadIdx.x; if ( index <
length )
{
data[index] = data[index] * data[index];
}
}
//Host code int main()
{
float *host_data, *gpu_data; int ThreadsPerBlock,
Blocks;
// Allocate memory
host_data = (float*)malloc( LEN * sizeof(float) );
cudaMalloc( &gpu_data, LEN*sizeof(float) );
// Initialize data on host
for( int i=0; i<LEN; i++ )
{
host_data[i] = 2*i;
}
// Copy host data to GPU
cudaMemcpy( gpu_data, host_data,
LEN*sizeof(int), cudaMemcpyHostToDevice );
// Perform
computation on GPU ThreadsPerBlock = 128;
Blocks = (int)( (LEN-1) / ThreadsPerBlock ) + 1;
square <<<Blocks, ThreadsPerBlock>>>( gpu_data, LEN );
// Copy
GPU data back to host
cudaMemcpy( gpu_data, host_data,
LEN*sizeof(int), cudaMemcpyDeviceToHost);
// Free allocated memory cudaFree( gpu_data );
free( host_data );
}
The code is in two routines. The routine square() contains the code that is actually executed by the GPU. The routine main() is executed by the host processor.
It is best to describe the routine main() first because this code performs the prepara-tion for the parallel
work. This routine needs to allocate memory both on the host system, using malloc(), and on the GPU, using the call cudaMalloc(). The method of passing data between the host system and the GPU is to
copy the data from the host system to the GPU. Hence, it is necessary to
reserve space for the data in both places.
The host_data array is
initialized and then copied over to the GPU by calling cudaMemcpy(). The cudaMemcpy() call is
used for transferring data in both directions, and the direction of the copy is determined by the last parameter passed
into the function. Copying data to and from the GPU requires the data to be
sent across the bus that connects the processor to the GPU. This bus provides
relatively low bandwidth, perhaps 8GB/s to 16GB/s. Once the data is
transferred, the GPU is able to sustain much higher aggregate bandwidths on the
order of 100GB/s. Therefore, transferring data to and from the GPU is to be
avoided as much as possible.
Each GPU supports a large number of software
threads. These threads are arranged in groups called blocks. The main program assigns work to a block of threads, and
each thread in the block executes the same routine. The number of threads in a
block can be set by the code. It should be a multiple of 32 and can be as large
as 512 threads.
The code takes the number of threads per block and uses this to
calculate how many blocks are needed to complete the work. The function call
syntax has been extended so that the call to square() takes both the normal parameters and the details of the number of blocks and the number of
threads per block. This function call causes the GPU to execute the code using
the specified number of blocks and threads per block.
Once the call to square() completes, the host copies the resulting data back from the device into
host memory using a second cudaMemcpy() call.
The last actions allow the host machine to free up the memory allocated on the
host and on the GPU.
The function square() is declared with the __global__ keyword
to indicate that it is a function that executes on the GPU but can be called by
the host code. Each hard-ware thread on the GPU will execute the routine. The
first thing the hardware thread needs to do is determine the index of the
element that it needs to compute. Information about the topology of the block
of threads is passed into the thread in the structures blockIdx, blockDim, and
threadIdx. These three structures allow
the CUDA framework the flexibility to specify that
the thread performs computation in some three-dimensional space. However, in
this example we are working on only a single, x, dimension. The index that a particular thread should compute can be
derived by multiplying the index of the block that is currently being computed
by the size of each block and then adding the index of the current thread. The
thread will then work on the element at that index.
In the Listing 10.1, the
computation performed is trivial, but much more complex work can be performed
on the GPU. The theoretical performance can reach teraflops of floating-point
computations per second. This can make for a compelling solution for codes that
require large amounts of computation.
There is another important point
to make. The current trend is to see increasing numbers of general-purpose CPUs
on the same chip. It is quite likely that in the future we will see processors
that resemble today’s GPUs. Consequently, understanding how codes can be scaled
to this level of threading is likely to be a useful skill in the future.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.