1. The Hardware Model
processor is one where all artificial constraints on ILP are removed. The only
limits on ILP in such a processor are those imposed by the actual data flows
either through registers or memory.
assumptions made for an ideal or perfect processor are as follows:
renaming—There are an infinite number of virtual registers available and hence
all WAW and WAR hazards are avoided and an unbounded number of instructions can
begin execution simultaneously.
prediction—Branch prediction is perfect. All conditional branches are predicted
prediction—All jumps (including jump register used for return and computed
jumps) are perfectly predicted. When combined with perfect branch prediction,
this is equivalent to having a processor with perfect speculation and an
unbounded buffer of instructions available for execution.
alias analysis—All memory addresses are known exactly and a load
moved before a store provided that the addresses are not identical.
Assumptions 2 and 3 eliminate all control
dependences. Likewise, assumptions 1 and 4 eliminate all but the true data dependences.
Together, these four assumptions mean that any instruction in the of the
program’s execution can be scheduled on the cycle immediately following the
execution of the predecessor on which it depends.
on the Window Size and Maximum Issue Count
A dynamic processor might be able to more closely
match the amount of parallelism uncovered by our ideal processor.
what the perfect processor must do:
Look arbitrarily far ahead to find a set of
instructions to issue, predicting all branches perfectly.
all register uses to avoid WAR and WAW hazards.
Determine whether there are any data dependencies
among the instructions in the issue packet; if so, rename accordingly.
if any memory dependences exist among the issuing instructions and handle them
enough replicated functional units to allow all the ready instructions to
this analysis is quite complicated. For example, to determine whether n issuing
instructions have any register dependences among them, assuming all
instructions are register-register and the total number of registers is
2n-2+2n-4+……..+2 = 2∑i=1 n-1 i = [2
(n-1)n]/2 = n2 -n
Thus, to detect dependences among the next 2000 instructions—the default size
we assume in several figures—requires almost four million comparisons! Even
issuing only 50 instructions requires 2450 comparisons. This cost obviously
limits the number of instructions that can be considered for issue at once.
3.The Effects of Realistic Branch
and Jump Prediction:
processor assumes that branches can be perfectly predicted: The outcome of any
branch in the program is known before the first instruction is executed.
levels of branch prediction shown in these figures are
branches and jumps are perfectly predicted at the start of execution.
branch predictor—The prediction scheme uses a correlating two-bit predictor and
a noncorrelating two-bit predictor together with a selector, which chooses the
best predictor for each branch.
two-bit predictor with 512 two-bit entries—In addition, we assume a 16-entry
buffer to predict returns.
static predictor uses the profile history of the program and predicts that the
branch is always taken or always not taken based on the profile.
branch prediction is used, though jumps are still predicted. Parallelism is
largely limited to within a basic block.
4. Limitations on ILP for
performance of processors an ambitious level of hardware support equal to or
better than what is likely in the next five years. In particular we assume the
following fixed attributes:
1. Up to 64
instruction issues per clock with no issue restrictions. As we discuss later,
the practical implications of very wide issue widths on clock rate, logic
complexity, and power may be the most important limitation on exploiting ILP.
tournament predictor with 1K entries and a 16-entry return predictor. This
predictor is fairly comparable to the best predictors in 2000; the predictor is
not a primary bottleneck.
disambiguation of memory references done dynamically—this is ambitious but
perhaps attainable for small window sizes (and hence small issue rates and
load/store buffers) or through a memory dependence predictor.
renaming with 64 additional integer and 64 additional FP registers,exceeding
largest number available on any processor in 2001 (41 and 41 in the Alpha
21264), but probably easily reachable within two or three years.