Processor Architectures
1
Instruction Pipelines and Branch Delays
2
Pipelined Execution
3
Multiple Instruction Issue
When we think of instruction-level parallelism, we usually imagine
a processor issuing several operations in a single clock cycle. In fact, it is
possible for a machine to issue just one operation per clock1 and yet
achieve instruction-level parallelism using the concept of pipelining.
In the following, we shall first explain pipelining then discuss
multiple-instruction issue.
1. Instruction Pipelines and Branch Delays
Practically every processor, be it a high-performance
supercomputer or a stan-dard machine, uses an instruction pipeline. With
an instruction pipeline, a new instruction can be fetched every clock while
preceding instructions are still going through the pipeline. Shown in Fig. 10.1
is a simple 5-stage instruction pipeline: it first fetches the instruction
(IF), decodes it (ID), executes the op-eration (EX), accesses the memory (MEM),
and writes back the result (WB). The figure shows how instructions i, i +
1, i + 2, i + 3, and i + 4 can execute at the same time.
Each row corresponds to a clock tick, and each column in the figure specifies
the stage each instruction occupies at each clock tick.
If the result from an instruction is available by
the time the succeeding in-struction needs the data, the processor can issue an
instruction every clock. Branch instructions are especially problematic because
until they are fetched, decoded and executed, the processor does not know which
instruction will ex-ecute next. Many processors speculatively fetch and decode
the immediately succeeding instructions in case a branch is not taken. When a
branch is found to be taken, the instruction pipeline is emptied and the branch
target is fetched.
Thus, taken branches introduce a delay in the fetch of the branch
target and introduce "hiccups" in the instruction pipeline. Advanced
processors use hard-ware to predict the outcomes of branches based on their
execution history and to prefetch from the predicted target locations. Branch
delays are nonetheless observed if branches are mispredicted.
2.
Pipelined Execution
Some instructions take several clocks to execute.
One common example is the memory-load operation. Even when a memory access hits
in the cache, it usu-ally takes several clocks for the cache to return the
data. We say that the execution of an instruction is pipelined if
succeeding instructions not dependent on the result are allowed to proceed.
Thus, even if a processor can issue only one operation per clock, several
operations might be in their execution stages at the same time. If the deepest
execution pipeline has n stages, potentially n operations can be "in flight" at the same time. Note that not all instructions are fully
pipelined. While floating-point adds and multiplies often are fully pipelined,
floating-point divides, being more complex and less frequently executed, often
are not.
Most general-purpose processors dynamically detect
dependences between consecutive instructions and automatically stall the
execution of instructions if their operands are not available. Some processors,
especially those embedded in hand-held devices, leave the dependence checking
to the software in order to keep the hardware simple and power consumption low.
In this case, the compiler is responsible for inserting "no-op"
instructions in the code if necessary to assure that the results are available
when needed.
3. Multiple Instruction Issue
By issuing several operations per clock, processors can keep even
more opera-tions in flight. The largest number of operations that can be
executed simul-taneously can be computed by multiplying the instruction issue
width by the average number of stages in the execution pipeline.
Like pipelining, parallelism on multiple-issue machines can be
managed ei-ther by software or hardware. Machines that rely on software to
manage their parallelism are known as VLIW (Very-Long-Instruction-Word)
machines, while those that manage their parallelism with hardware are known as superscalar
machines. VLIW machines, as their name implies, have wider than normal
instruction words that encode the operations to be issued in a single clock.
The compiler decides which operations are to be issued in parallel and encodes
the information in the machine code explicitly. Superscalar machines, on the
other hand, have a regular instruction set with an ordinary
sequential-execution semantics. Superscalar machines automatically detect
dependences among in-structions and issue them as their operands become
available. Some processors include both VLIW and superscalar functionality.
Simple hardware schedulers execute instructions in the order in
which they are fetched. If a scheduler comes across a dependent instruction, it
and all instructions that follow must wait until the dependences are resolved
(i.e., the needed results are available). Such machines obviously can benefit
from having a static scheduler that places independent operations next to each
other in the order of execution.
More sophisticated schedulers can execute instructions "out
of order." Op-erations are independently stalled and not allowed to
execute until all the values they depend on have been produced. Even these
schedulers benefit from static scheduling, because hardware schedulers have
only a limited space in which to buffer operations that must be stalled. Static
scheduling can place independent operations close together to allow better
hardware utilization. More impor-tantly, regardless how sophisticated a dynamic
scheduler is, it cannot execute instructions it has not fetched. When the
processor has to take an unexpected branch, it can only find parallelism among
the newly fetched instructions. The compiler can enhance the performance of the
dynamic scheduler by ensuring that these newly fetched instructions can execute
in parallel.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.