Latency and Bandwidth Impact Performance
latency is the time between a processor requesting an item of data and that
item of data arriving from memory. The more processors there are in a system,
the longer the memory latency. A system with a single processor can have memory
latency of less than 100ns; with two processors this can double, and when the
system gets large and comprises multiple boards, the memory latency can become
very high. Memory latency is a problem because there is little that a processor
can do while it is waiting for data that it needs to be returned from memory.
There are techniques, such as out-of-order (OoO) execution, which enable the
processor to make some forward progress while waiting for data from memory.
However, it is unlikely that these techniques will hide the entire cost of a
memory miss, although they may manage to cover the time it takes to get data
from the second-level cache. These techniques also add significant com-plexity
and implementation area to the design of the processor core.
that support multiple hardware threads are an alternative solution to the
prob-lem of memory latency. When one thread is stalled waiting for data to be
returned from memory, the other threads can still make forward progress. So
although having multiple hardware threads does not improve the performance of
the stalled thread, it improves the utilization of the core and consequently
improves the throughput of the core (that is, there are threads completing work
even if one thread is stalled).
measurement that is relevant to discussions of memory is bandwidth. The bandwidth measures how much data can be returned
from memory per second. For example, imagine that in one second a virtual CPU
issues 10 million load instructions and each request misses cache. Each cache
miss will fetch a 64-byte cache line from memory so that a single virtual CPU
has consumed a bandwidth of 640MB in a second.
chip can make large demands of memory bandwidth since, at any one time, each
thread could possibly have one or more outstanding memory requests. Suppose
that there are 64 threads on a processor, the memory latency is 100 cycles, and
the processor is clocked at a modest rate of 1GHz. If each thread is constantly
issuing requests for new cache lines from memory, then each thread will issue
one such request every 100 cycles (100 cycles being the time it takes for the
previous request to complete). This makes
/ 100 ∗ 64 = 640 million memory requests per second. If
each request is for a fresh 64-byte cache line, then this represents an
aggregate bandwidth of approximately 41GB/s.