Chapter: Multicore Application Programming For Windows, Linux, and Oracle Solaris : Identifying Opportunities for Parallelism

Data Parallelism Using SIMD Instructions

Although this book discusses data parallelism in the context of multiple threads cooperating on processing the same item of data, the concept also extends into instruction sets.

Data Parallelism Using SIMD Instructions

Although this book discusses data parallelism in the context of multiple threads cooperating on processing the same item of data, the concept also extends into instruction sets. There are instructions, called single instruction multiple data (SIMD) instructions, that load a vector of data and perform an operation on all the items in the vector. Most processors have these instructions: the SSE instruction set extensions for x86 processors, the VIS instructions for SPARC processors, and the AltiVec instructions on Power/

PowerPC processors.

The loop shown in Listing 3.1 is ideal for conversion into SIMD instructions.

Listing 3.1 Loop Adding Two Vectors

void vadd(double * restrict a, double * restrict b , int count)

{

for (int i=0; i < count; i++)

{

a[i] += b[i];

}

Compiling this on an x86 box without enabling SIMD instructions generates the assembly language loop shown in Listing 3.2.

Listing 3.2 Assembly Language Code to Add Two Vectors Using x87 Instructions

loop:

fldl (%edx) // Load the value of a[i]

faddl (%ecx) // Add the value of b[i]

fstpl (%edx) // Store the result back to a[i]

addl 8,%edx // Increment the pointer to a

addl 8,%ecx // Increment the pointer to b

addl 1,%esi // Increment the loop counter

cmp %eax,%esi // Test for the end of the loop

jle loop // Branch back to start of loop if not complete

Compiling with SIMD instructions produces code similar to that shown in Listing 3.3.

Listing 3.3 Assembly Language Code to Add Two Vectors Using SSE Instructions

loop:

movupd (%edx),%xmm0 // Load a[i] and a[i+1] into vector register

movupd ($ecx),%xmm1 // Load b[i] and b[i+1] into vector register

addpd %xmm1,%xmm0 // Add vector registers

movpd %xmm0,(%edx) // Store a[i] and a[i+1] back to memory

addl 16,%edx // Increment pointer to a

addl 16,%ecx // Increment pointer to b

addl 2,%esi // Increment loop counter

cmp %eax,%esi // Test for the end of the loop

jle loop // Branch back to start of loop if not complete

Since two double-precision values are computed at the same time, the trip count around the loop is halved, so the number of instructions is halved. The move to SIMD instructions also enables the compiler to avoid the inefficiencies of the stack-based x87 floating-point architecture.

SIMD and parallelization are very complementary technologies. SIMD is often useful in situations where loops perform operations over vectors of data. These same loops could also be parallelized. Simultaneously using both approaches enables a multicore chip to achieve high throughput. However, SIMD instructions have an additional advantage in that they can also be useful in situations where the amount of work is too small to be effectively parallelized.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Multicore Application Programming For Windows, Linux, and Oracle Solaris : Identifying Opportunities for Parallelism : Data Parallelism Using SIMD Instructions |