Instruction Issue Rate with Pipelined Processor Cores
previously discussed, the core of a
processor is the part of the processor responsible for executing instructions.
Early processors would execute a single instruction every cycle, so a processor
that ran at 4MHz could execute 4 million instructions every second. The logic
to execute a single instruction could be quite complex, so the time it takes to
execute the longest instruction determined how long a cycle had to take and
therefore defined the maximum clock speed for the processor.
improve this situation, processor designs became “pipelined.” The operations
nec-essary to complete a single instruction were broken down into multiple
smaller steps. This was the simplest pipeline:
n Fetch. Fetch the next instruction from
n Decode. Determine what type of
instruction it is.
n Execute. Do the appropriate work.
n Retire. Make the state changes from the
instruction visible to the rest of the system.
that the overall time it takes for an instruction to complete remains the same,
each of the four steps takes one-quarter of the original time. However, once an
instruction has completed the Fetch step, the next instruction can enter that
stage. This means that four instructions can be in execution at the same time.
The clock rate, which determines when an instruction completes a pipeline
stage, can now be four times faster than it was. It now takes four clock cycles
for an instruction to complete execution. This means that each instruction
takes the same wall time to complete its execution. But there are now four
instructions progressing through the processor pipeline, so the pipelined
processor can execute instructions at four times the rate of the nonpipelined
example, Figure 1.9 shows the integer and floating-point pipelines from the
UltraSPARC T2 processor. The integer pipeline has eight stages, and the
floating-point pipeline has twelve stages.
given to the various stages are not of great importance, but several aspects of
the pipeline are worthy of discussion. Four pipeline stages are performed
regardless of whether the instruction is floating point or integer. Only at the
Execute stage of the pipeline does the path diverge for the two instruction
instructions, the result of the operation can be made available to any
subse-quent instructions at the Bypass stage. The subsequent instruction needs
the data at the Execute stage, so if the first instruction starts executing at
cycle zero, a dependent instruction can start in cycle 3 and expect the data to
be available by the time it is needed. This is shown in Figure 1.10 for integer
instructions. An instruction that is fetched in cycle 0 will produce a result
that can be bypassed to a following instruction seven cycles later when it
reaches the Bypass stage. The dependent instruction would need this result as
input when it reaches the Execute stage. If an instruction is fetched every
cycle, then the fourth instruction will have reached the Execute stage by the
time the first instruction has reached the Bypass stage.
downside of long pipelines is correcting execution in the event of an error;
the most common example of this is mispredicted branches.
fetching instructions, the processor needs to guess the next instruction that
will be executed. Most of the time this will be the instruction at the
following address in memory. However, a branch instruction might change the
address where the instruction is to be fetched from—but the processor will know
this only once all the conditions that the branch depends on have been resolved
and once the actual branch instruction has been executed.
approach to dealing with this is to predict whether branches are taken and then
to start fetching instructions from the predicted address. If the processor
predicts correctly, then there is no interruption to the instruction steam—and
no cost to the branch. If the processor predicts incorrectly, all the
instructions executed after the branch need to be flushed, and the correct
instruction stream needs to be fetched from memory. These are called branch mispredictions, and their cost is
proportional to the length of the pipeline. The longer the pipeline, the longer
it takes to get the correct instructions through the pipeline in the event of a
enabled higher clock speeds for processors, but they were still executing only
a single instruction every cycle. The next improvement was “super-scalar
execution,” which means the ability to execute multiple instructions per cycle.
The Intel Pentium was the first x86 processor that could execute multiple
instructions on the same cycle; it had two pipelines, each of which could
execute an instruction every cycle. Having two pipelines potentially doubled
performance over the previous generation.
recent processors have four or more pipelines. Each pipeline is specialized to handle
a particular type of instruction. It is typical to have a memory pipeline that
han-dles loads and stores, an integer pipeline that handles integer
computations (integer addi-tion, shifts, comparison, and so on), a
floating-point pipeline (to handle floating-point computation), and a branch
pipeline (for branch or call instructions). Schematically, this would look
something like Figure 1.11.
UltraSPARC T2 discussed earlier has four pipelines for each core: two for
inte-ger operations, one for memory operations, and one for floating-point
operations. These four pipelines are shared between two groups of four threads,
and every cycle one thread from both of the groups can issue an instruction.