Data
Parallelism Using SIMD Instructions
Although
this book discusses data parallelism in the context of multiple threads
cooperating on processing the same item of data, the concept also extends into
instruction sets. There are instructions, called single instruction multiple data (SIMD) instructions, that load a
vector of data and perform an operation on all the items in the vector. Most
processors have these instructions: the SSE instruction set extensions for x86
processors, the VIS instructions for SPARC processors, and the AltiVec
instructions on Power/
PowerPC
processors.
The loop
shown in Listing 3.1 is ideal for conversion into SIMD instructions.
Listing 3.1 Loop
Adding Two Vectors
void
vadd(double * restrict a, double * restrict b , int count)
{
for (int i=0; i < count; i++)
{
a[i] += b[i];
}
}
Compiling this on an x86 box without enabling SIMD
instructions generates the assembly language loop shown in Listing 3.2.
Listing 3.2 Assembly
Language Code to Add Two Vectors Using x87 Instructions
loop:
fldl (%edx) // Load the value of a[i]
faddl (%ecx) // Add the value of b[i]
fstpl (%edx) // Store the result back to a[i]
addl 8,%edx // Increment the pointer to a
addl 8,%ecx // Increment the pointer to b
addl 1,%esi // Increment the loop counter
cmp %eax,%esi // Test for the end of the loop
jle loop // Branch back to start of loop if not
complete
Compiling
with SIMD instructions produces code similar to that shown in Listing 3.3.
Listing 3.3 Assembly Language Code to Add Two Vectors Using SSE Instructions
loop:
movupd (%edx),%xmm0 // Load a[i] and a[i+1] into vector register
movupd ($ecx),%xmm1 // Load
b[i] and b[i+1] into vector register
addpd %xmm1,%xmm0 // Add
vector registers
movpd %xmm0,(%edx) // Store
a[i] and a[i+1] back to memory
addl 16,%edx // Increment
pointer to a
addl 16,%ecx // Increment
pointer to b
addl 2,%esi // Increment
loop counter
cmp %eax,%esi // Test
for the end of the loop
jle loop // Branch
back to start of loop if not complete
Since two double-precision values are computed at
the same time, the trip count around the loop is halved, so the number of
instructions is halved. The move to SIMD instructions also enables the compiler
to avoid the inefficiencies of the stack-based x87 floating-point architecture.
SIMD and parallelization are very complementary
technologies. SIMD is often useful in situations where loops perform operations
over vectors of data. These same loops could also be parallelized.
Simultaneously using both approaches enables a multicore chip to achieve high
throughput. However, SIMD instructions have an additional advantage in that
they can also be useful in situations where the amount of work is too small to
be effectively parallelized.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.