How Latency and Bandwidth Impact Performance
Memory latency is the time between a processor requesting an item of data and that item of data arriving from memory. The more processors there are in a system, the longer the memory latency. A system with a single processor can have memory latency of less than 100ns; with two processors this can double, and when the system gets large and comprises multiple boards, the memory latency can become very high. Memory latency is a problem because there is little that a processor can do while it is waiting for data that it needs to be returned from memory. There are techniques, such as out-of-order (OoO) execution, which enable the processor to make some forward progress while waiting for data from memory. However, it is unlikely that these techniques will hide the entire cost of a memory miss, although they may manage to cover the time it takes to get data from the second-level cache. These techniques also add significant com-plexity and implementation area to the design of the processor core.
Cores that support multiple hardware threads are an alternative solution to the prob-lem of memory latency. When one thread is stalled waiting for data to be returned from memory, the other threads can still make forward progress. So although having multiple hardware threads does not improve the performance of the stalled thread, it improves the utilization of the core and consequently improves the throughput of the core (that is, there are threads completing work even if one thread is stalled).
The other measurement that is relevant to discussions of memory is bandwidth. The bandwidth measures how much data can be returned from memory per second. For example, imagine that in one second a virtual CPU issues 10 million load instructions and each request misses cache. Each cache miss will fetch a 64-byte cache line from memory so that a single virtual CPU has consumed a bandwidth of 640MB in a second.
A CMT chip can make large demands of memory bandwidth since, at any one time, each thread could possibly have one or more outstanding memory requests. Suppose that there are 64 threads on a processor, the memory latency is 100 cycles, and the processor is clocked at a modest rate of 1GHz. If each thread is constantly issuing requests for new cache lines from memory, then each thread will issue one such request every 100 cycles (100 cycles being the time it takes for the previous request to complete). This makes
1 billion / 100 ∗ 64 = 640 million memory requests per second. If each request is for a fresh 64-byte cache line, then this represents an aggregate bandwidth of approximately 41GB/s.