PARALLEL PROCESSING CHALLENGES
Limitations of ILP
The Hardware Model
An ideal processor is one where
all constraints on ILP are removed. The only limits on ILP in such a processor
are those imposed by the actual data flows through either registers or memory.
The
assumptions made for an ideal or perfect processor are as follows:
1.Register renaming
—There are
an infinite number of virtual registers available, and hence all WAW and
WAR
hazards are avoided and an unbounded number of instructions can begin execution
simultaneously.
2.Branch prediction
—Branch
prediction is perfect. All conditional branches are predicted exactly.
3.Jump prediction
—All jumps
(including jump register used for return and computed jumps) are perfectly
predicted.
When combined with perfect branch prediction, this is equivalent to having a
processor with perfect speculation and an unbounded buffer of instructions
available for execution.
4.Memory address alias analysis
—All
memory addresses are known exactly, and a load can be moved before a store
provided
that the addresses are not identical. Note that this implements perfect address
alias analysis.
5.Perfect caches
—All
memory accesses take 1 clock cycle. In practice, superscalar processors will
typically
consume large amounts of ILP hiding cache misses, making these results highly
optimistic.
To measure the available
parallelism, a set of programs was compiled and optimized with the standard
MIPS optimizing compilers. The programs were instrumented and executed to
produce a trace of the instruction and data references. Every instruction in
the trace is then scheduled as early as possible, limited only by the data
dependences. Since a trace is used, perfect branch prediction and perfect alias
analysis are easy to do. With these mechanisms, instructions may bescheduled
much earlier than they would otherwise, moving across large numbers of
instructions on which they are not data dependent, including branches, since
branches are perfectly predicted.
The effects of various
assumptions are given before looking at some ambitious but realizable
processors.
Limitations on the Window Size and Maximum Issue
Count
To build a processor that even
comes close to perfect branch prediction and perfect alias analysis requires
extensive dynamic analysis, since static compile time schemes cannot be perfect.
Of course, most realistic dynamic schemes will not be perfect, but the use of
dynamic schemes will provide the ability to uncover parallelism that cannot be
analyzed by static compile time analysis. Thus, a dynamic processor might be
able to more closely match the amount of parallelism uncovered by our ideal
processor.
The Effects of Realistic Branch and Jump
Prediction
Our ideal processor assumes that
branches can be perfectly predicted: The outcome of any branch in the program
is known before the first instruction is executed! Of course, no real processor
can ever achieve this.
We assume a separate predictor is
used for jumps. Jump predictors are important primarily with the most accurate
branch predictors, since the branch frequency is higher and the accuracy of the
branch predictors dominates.
1.Perfect —All branches and jumps are
perfectly predicted at the start of execution.
2.Tournament-based branch predictor —The
prediction scheme uses a correlating 2-bit predictor and a noncorrelating 2-bit
predictor together with a selector, which chooses the best predictor for each
branch.
The Effects of Finite Registers
Our ideal processor eliminates
all name dependences among register references using an infinite set of virtual
registers. To date, the IBM Power5 has provided the largest numbers of virtual
registers: 88 additional floating-point and 88 additional integer registers, in
addition to the 64 registers available in the base architecture. All 240
registers are shared by two threads when executing in multithreading mode, and
all are available to a single thread when in single-thread mode.
The Effects of Imperfect Alias Analysis
Our optimal model assumes that it
can perfectly analyze all memory dependences, as well as eliminate all register
name dependences. Of course, perfect alias analysis is not possible in
practice: The analysis cannot be perfect at compile time, and it requires a
potentially unbounded number of comparisons at run time (since the number of
simultaneous memory references is unconstrained).
The three models are
1.
Global/stack perfect—This
model does perfect predictions for global and stack references and assumes all
heap references conflict. This model represents an idealized version of the
best compiler-based analysis schemes currently in production. Recent and
ongoing research on alias analysis for pointers should improve the handling of
pointers to the heap in the future.
2. Inspection—This
model examines the accesses to see if they can be determined not to interfere
at compile time. For example, if an access uses R10 as a base register with an
offset of 20, then another access that uses R10 as a base register with an
offset of 100 cannot interfere, assuming R10 could not have changed. In
addition, addresses based on registers that point to different allocation areas
(such as the global area and the stack area) are assumed never to alias. This
analysis is similar to that performed by many existing commercial compilers,
though newer compilers can do better, at least for looporiented programs.
3.
None—All
memory references are assumed to conflict. As you might expect, for the FORTRAN
programs (where no heap references exist), there is no difference between
perfect and global/stack perfect analysis
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.