SMT and CMP Architectures
Ø Wide-issue Superscalar processors (SS)
Ø Four or more instruction per cycle
Ø Executing a single program or thread
Ø Attempts to find multiple instructions to issue each cycle.
Ø Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order
Ø Fine-grained multithreaded superscalars(FGMS)
Ø Contain hardware state for several threads
Ø Executing multiple threads
Ø On any given cycle a processor executes instructions from one of the threads
Ø Performance improved by adding more CPUs
The idea is issue multiple instructions from multiple threads each cycle
The Features are
Ø Fully exploit thread-level parallelism and instruction-level parallelism.
Ø Multiple functional units
Ø Modern processors have more functional units available then a single thread can utilize.
Ø Register renaming and dynamic scheduling
Ø Multiple instructions from independent threads can co-exist and co-execute.
Superscalar processor with no multithreading:
Only one thread is processed in one clock cycle
Ø Use of issue slots is limited by a lack of ILP.
Ø Stalls such as an instruction cache miss leaves the entire processor idle.
Fine grained Multithreading
Switches threads on every clock cycle
Ø Pro: hide latency of from both short and long stalls
Ø Con: Slows down execution of the individual threads ready to go. Only one thread issues inst. In a given clock cycle.
Switches threads only on costly stalls (e.g., L2 stalls)
Ø Pros: no switching each clock cycle, no slow down for ready-to-go threads. Reduces no of completely idle clock cycles.
Ø Con: limitations in hiding shorter stalls
Exploits TLP at the same time it exploits ILP with multiple threads using the issue slots in a single-clock cycle.
Ø issue slots is limited by the following factors:
Ø Imbalances in the resource needs.
Ø Resource availability over multiple threads.
Ø Number of active threads considered.
Ø Finite limitations of buffer.
Ø Ability to fetch enough instructions from multiple threads.
Ø Practical limitations of what instructions combinations can issue from one thread and multiple threads.
Performance Implications of SMT
Ø Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread
Ø While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources
Ø With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4
Ø Alpha 21464 and Intel Pentium 4 are examples of SMT
Effectively Using Parallelism on a SMT Processor
Instruction Throughput executing a parallel workload
Comparison of SMT vs Superscalar
SMT processors are compared to base superscalar processors in several key measures :
Ø Utilization of functional units.
Ø Utilization of fetch units.
Ø Accuracy of branch predictor.
Ø Hit rates of primary caches.
Ø Hit rates of secondary caches.
Ø Issue slots.
Ø Funtional units.
Ø Renaming registers.
1. CMP Architecture
Ø Chip-level multiprocessing(CMP or multicore): integrates two or more independent cores(normally a CPU) into a single package composed of a single integrated circuit(IC), called a die, or more dies packaged, each executing threads independently.
Ø Every funtional units of a processor is duplicated.
Ø Multiple processors, each with a full set of architectural resources, reside on the same die
Ø Processors may share an on-chip cache or each can have its own cache
Ø Examples: HP Mako, IBM Power4
Ø Challenges: Power, Die area (cost)
Single core computer
Chip Multithreading = Chip Multiprocessing + Hardware Multithreading.
Ø Chip Multithreading is the capability of a processor to process multiple s/w threads simulataneous h/w threads of execution.
Ø CMP is achieved by multiple cores on a single chip or multiple threads on a single core.
Ø CMP processors are especially suited to server workloads, which generally have high levels of Thread-Level Parallelism(TLP).
Ø CMP’s are now the only way to build high performance microprocessors , for a variety of reasons:
Ø Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream.
Ø Cannot simply ratchet up the clock speed on today’s processors,or the power dissipation will become prohibitive.
Ø CMT processors support many h/w strands through efficient sharing of on-chip resources such as pipelines, caches and predictors.
Ø CMT processors are a good match for server workloads,which have high levels of TLP and relatively low levels of ILP.
SMT and CMP
Ø The performance race between SMT and CMP is not yet decided.
Ø CMP is easier to implement, but only SMT has the ability to hide latencies.
Ø A functional partitioning is not exactly reached within a SMT processor due to the centralized instruction issue.
Ø A separation of the thread queues is a possible solution, although it does not remove the central instruction issue.
Ø A combination of simultaneous multithreading with the CMP may be superior.
Ø Research : combine SMT or CMP organization with the ability to create threads with compiler support of fully dynamically out of a single thread.
Ø Thread-level speculation
Ø Close to multiscalar