Home | | Advanced Computer Architecture | SMT and CMP Architectures

Chapter: Advanced Computer Architecture : Multi-Core Architectures

SMT and CMP Architectures

Chip-level multiprocessing(CMP or multicore): integrates two or more independent cores(normally a CPU) into a single package composed of a single integrated circuit(IC), called a die, or more dies packaged, each executing threads independently.

SMT and CMP Architectures

 

Instruction-level parallelism(ILP)

 

Ø Wide-issue Superscalar processors  (SS)

 

Ø Four or more instruction per cycle

Ø Executing a single program or thread

Ø Attempts to find multiple instructions to issue each cycle.

 

Ø Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order

 

Thread-level parallelism(TLP)

 

Ø Fine-grained multithreaded superscalars(FGMS)

 

Ø Contain hardware state for several threads

Ø Executing multiple threads

Ø On any given cycle a processor executes instructions from one  of  the threads

 

Ø Multiprocessor(MP)

 

Ø Performance improved by adding more CPUs

 

Simultaneous Multithreading

 

The idea is issue multiple instructions from multiple threads each cycle

 

The Features  are

 

Ø Fully exploit thread-level parallelism and instruction-level parallelism.

 

Ø Multiple functional units

 

Ø Modern processors have more functional units available then a single thread can utilize.

 

Ø Register renaming and dynamic scheduling

 

Ø Multiple instructions from independent threads can co-exist and co-execute.

 

Superscalar processor with no multithreading:

 

Only one thread is processed in one clock cycle

 

Ø Use of issue slots is limited by a lack of ILP.

 

Ø Stalls such as an instruction cache miss leaves the entire processor idle.

 

Fine grained Multithreading

 

Switches threads on every clock cycle

 

Ø Pro: hide latency of from both short and long stalls

 

Ø Con: Slows down execution of the individual threads ready to go. Only one thread issues inst. In a given clock cycle.

 

Course-grained multithreading:

 

Switches threads only on costly stalls  (e.g., L2 stalls)

 

Ø Pros: no switching each clock cycle, no slow down for ready-to-go threads. Reduces no of completely idle clock cycles.

 

Ø Con: limitations in hiding shorter stalls

 

Simultaneous Multithreading:

 

Exploits TLP at the same time it exploits ILP with multiple threads using the issue slots in a single-clock cycle.

 

Ø  issue slots is limited by the following factors:

 

Ø Imbalances in the resource needs.

Ø Resource availability over multiple threads.

Ø Number of active threads considered.

Ø Finite limitations of buffer.

Ø Ability to fetch enough instructions from multiple threads.

 

Ø Practical limitations of what instructions combinations can issue from one thread and multiple threads.

 

Performance Implications of SMT

 

Ø    Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread

Ø    While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources

 

Ø    With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4

 

Ø    Alpha 21464 and Intel Pentium 4 are examples of SMT

 

Effectively Using Parallelism on a SMT Processor

 

Instruction Throughput executing a parallel workload

 

 

Comparison of SMT vs Superscalar

 

SMT processors are compared to base superscalar processors in several key measures :

Ø Utilization of functional units.

Ø Utilization of fetch units.

Ø Accuracy of branch predictor.

Ø Hit rates of primary caches.

Ø Hit rates of secondary caches.

 

Performance improvement:

 

Ø Issue slots.

 

Ø Funtional units.

 

Ø Renaming registers.

 

1. CMP Architecture

 

Ø Chip-level multiprocessing(CMP or multicore): integrates two or more independent cores(normally a CPU) into a single package composed of a single integrated circuit(IC), called a die, or more dies packaged, each executing threads independently.

 

Ø Every funtional units of a processor is duplicated.

 

Ø Multiple processors, each with a full set of architectural resources, reside on the same die

 

Ø Processors may share an on-chip cache or each can have its own cache

 

 

Ø Examples: HP Mako, IBM Power4

Ø Challenges: Power, Die area (cost)

 

Single core computer





 

Chip Multithreading

 

Chip  Multithreading = Chip Multiprocessing + Hardware  Multithreading.

 

Ø Chip Multithreading is the capability of a processor to process multiple s/w threads simulataneous h/w threads of execution.

 

Ø CMP is achieved by multiple cores on a single chip or multiple threads on a single core.

 

Ø CMP processors are especially suited to server workloads, which generally have high levels of Thread-Level Parallelism(TLP).

 

CMP’s Performance

 

Ø CMP’s are now the only way to build high performance microprocessors , for a variety of reasons:

 

Ø Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream.

 

Ø Cannot simply ratchet up the clock speed on today’s processors,or the power dissipation will become prohibitive.

 

Ø CMT processors support many h/w strands through efficient sharing of on-chip resources such as pipelines, caches and predictors.

 

Ø CMT processors are a good match for server workloads,which have high levels of TLP and relatively low levels of ILP.

SMT and CMP

 

Ø The performance race between SMT and CMP is not yet decided.

Ø CMP is easier to implement, but only SMT has the ability to hide latencies.

 

Ø A functional partitioning is not exactly reached within a SMT processor due to the centralized instruction issue.

 

Ø A separation of the thread queues is a possible solution, although it does not remove the central instruction issue.

Ø A combination of simultaneous multithreading with the CMP may be superior.

 

Ø Research : combine SMT or CMP organization with the ability to create threads with compiler support of fully dynamically out of a single thread.

Ø Thread-level speculation

Ø Close to multiscalar


Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
Advanced Computer Architecture : Multi-Core Architectures : SMT and CMP Architectures |


Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.