SMT and CMP Architectures
Superscalar processors (SS)
Ø Four or
more instruction per cycle
a single program or thread
to find multiple instructions to issue each cycle.
execution => instructions are sent to execution units based on instruction
dependencies rather than program order
hardware state for several threads
Ø On any
given cycle a processor executes instructions from one of the
improved by adding more CPUs
is issue multiple instructions from multiple threads each cycle
exploit thread-level parallelism and instruction-level parallelism.
processors have more functional units available then a single thread can
renaming and dynamic scheduling
Ø Multiple instructions from
independent threads can co-exist and co-execute.
Superscalar processor with no multithreading:
thread is processed in one clock cycle
Ø Use of
issue slots is limited by a lack of ILP.
such as an instruction cache miss leaves the entire processor idle.
Fine grained Multithreading
threads on every clock cycle
Ø Pro: hide
latency of from both short and long stalls
Slows down execution of the individual threads ready to go. Only one thread
issues inst. In a given clock cycle.
threads only on costly stalls (e.g., L2
Ø Pros: no
switching each clock cycle, no slow down for ready-to-go threads. Reduces no of
completely idle clock cycles.
limitations in hiding shorter stalls
TLP at the same time it exploits ILP with multiple threads using the issue
slots in a single-clock cycle.
slots is limited by the following factors:
in the resource needs.
availability over multiple threads.
Ø Number of
active threads considered.
limitations of buffer.
to fetch enough instructions from multiple threads.
limitations of what instructions combinations can issue from one thread and
Performance Implications of SMT
thread performance is likely to go down (caches, branch predictors, registers,
etc. are shared) – this effect can be mitigated by trying to prioritize one
fetching instructions, thread priority can dramatically influence total
throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread
has an equal share of processor resources
eight threads in a processor with many resources, SMT yields throughput
improvements of roughly 2-4
21464 and Intel Pentium 4 are examples of SMT
Effectively Using Parallelism on a SMT Processor
Instruction Throughput executing
a parallel workload
Comparison of SMT vs Superscalar
processors are compared to base superscalar processors in several key measures
of functional units.
of fetch units.
of branch predictor.
Ø Hit rates
of primary caches.
Ø Hit rates
of secondary caches.
1. CMP Architecture
multiprocessing(CMP or multicore): integrates two or more independent
cores(normally a CPU) into a single package composed of a single integrated
circuit(IC), called a die, or more dies packaged, each executing threads
funtional units of a processor is duplicated.
processors, each with a full set of architectural resources, reside on the same
may share an on-chip cache or each can have its own cache
HP Mako, IBM Power4
Power, Die area (cost)
Single core computer
Chip Multithreading = Chip Multiprocessing +
Multithreading is the capability of a processor to process multiple s/w threads
simulataneous h/w threads of execution.
Ø CMP is
achieved by multiple cores on a single chip or multiple threads on a single
processors are especially suited to server workloads, which generally have high
levels of Thread-Level Parallelism(TLP).
Ø CMP’s are
now the only way to build high performance microprocessors , for a variety of
uniprocessors are no longer scaling in performance, because it is only possible
to extract a limited amount of parallelism from a typical instruction stream.
simply ratchet up the clock speed on today’s processors,or the power
dissipation will become prohibitive.
processors support many h/w strands through efficient sharing of on-chip
resources such as pipelines, caches and predictors.
processors are a good match for server workloads,which have high levels of TLP
and relatively low levels of ILP.
SMT and CMP
performance race between SMT and CMP is not yet decided.
Ø CMP is
easier to implement, but only SMT has the ability to hide latencies.
functional partitioning is not exactly reached within a SMT processor due to
the centralized instruction issue.
separation of the thread queues is a possible solution, although it does not
remove the central instruction issue.
combination of simultaneous multithreading with the CMP may be superior.
: combine SMT or CMP organization with the ability to create threads with
compiler support of fully dynamically out of a single thread.
Ø Close to