Multiple Threads on a Single Chip
of a processor is the part of the chip responsible for executing instructions.
The core has many parts, and we will discuss some of those parts in detail
later in this chapter. A simplified schematic of a processor might look like Figure
an area of memory on the chip that holds recently used data and instructions.
When you look at the piece of silicon inside a processor, such as that shown in
Figure 1.7, the core and the cache are the two components that are identifiable
to the eye. We will discuss cache in the “Caches” section later in this
simplest way of enabling a chip to run multiple threads is to duplicate the
core multiple times, as shown in Figure 1.4. The earliest processors capable of
supporting mul-tiple threads relied on this approach. This is the fundamental
idea of multicore proces-sors. It is an easy approach because it takes an
existing processor design and replicates it. There are some complications
involved in making the two cores communicate with each other and with the
system, but the changes to the core (which is the most complex part of the
processor) are minimal. The two cores share an interface to the rest of the
system, which means that system access must be shared between the two cores.
this is not the only approach. An alternative is to make a single core execute
multiple threads of instructions, as shown in Figure 1.5. There are various
refinements on this design:
could execute instructions from one software thread for 100 cycles and then
switch to another thread for the next 100.
could alternate every cycle between fetching an instruction from one thread and
fetching an instruction from the other thread.
could simultaneously fetch an instruction from each of multiple threads every
could switch software threads every time the stream that is currently executing
hits a long latency event (such as a cache miss, where the data has to be fetched
threads sharing a core, each thread will get a share of the resources. The size
of the share will depend on the activity of the other thread and the number of
resources available. For example, if one thread is stalled waiting on memory,
then the other thread may have exclusive access to all the resources of the
core. However, if both threads want to simultaneously issue the same type of
instruction, then for some processors only one thread will be successful, and
the other thread will have to retry on the next opportunity.
multicore processors use a combination of multiple cores and multiple threads
per core. The simplest example of this would be a processor with two cores with
each core being capable of supporting two threads, making a total of four
threads for the entire processor. Figure 1.6 shows this configuration.
ability to handle multiple threads is exposed to the operating system, it
usually appears that the system has many virtual CPUs. Therefore, from the
perspective of the user, the system is capable of running multiple threads. One
term used to describe this is chip
multithreading (CMT)—one chip, many threads. This term places the emphasis
on the fact that there are many threads, without stressing about the
implementation details of how threads are assigned to cores.
UltraSPARC T2 is a good example of a CMT processor. It has eight replicated
cores, and each core is capable of running eight threads, making the processor
capable of running 64 software threads simultaneously. Figure 1.7 shows the
physical layout of the processor.
UltraSPARC T2 floor plan has a number of different areas that offer support
functionality to the cores of the processor; these are mainly located around
the outside edge of the chip. The eight processor cores are readily
identifiable because of their struc-tural similarity. For example, SPARC Core 2
is the vertical reflection of SPARC Core 0, which is the horizontal reflection
of SPARC Core 4. The other obvious structure is the crosshatch pattern that is
caused by the regular structure elements that form the second-level cache area;
this is an area of on-chip memory that is shared between all the cores. This
memory holds recently used data and makes it less likely that data will have to
be fetched from memory; it also enables data to be quickly shared between
important to realize that the implementation details of CMT processors do have
detectable effects, particularly when multiple threads are distributed over the
system. But the hardware threads can usually be considered as all being equal.
In current processor designs, there are not fast hardware threads and slow
hardware threads; the performance of a thread depends on what else is currently
executing on the system, not on some invariant property of the design.
example, suppose the CPU in a system has two cores, and each core can support
two threads. When two threads are running on that system, either they can be on
the same core or they can be on different cores. It is probable that when the
threads share a core, they run slower than if they were scheduled on different
cores. This is an obvious result of having to share resources in one instance
and not having to share resources in the other.
operating systems are evolving to include concepts of locality of memory and
sharing of processor resources so that they can automatically assign work in
the best possible way. An example of this is the locality group information used by the Solaris oper-ating system to
schedule work to processors. This information tells the operating system which
virtual processors share resources. Best performance will probably be attained
by scheduling work to virtual processors that do not share resources.
situation where it is useful for the operating system to understand the
topology of the system is when a thread wakes up and is unable to be scheduled
to exactly the same virtual CPU that was running it earlier. Then the thread
can be sched-uled to a virtual CPU that shares the same locality group. This is
less of a disturbance than running it on a virtual processor that shares
nothing with the original virtual processor. For example, Linux has the concept
of affinity, which keeps threads
local to where they were previously executing.
of topological information becomes even more important in systems where there
are multiple processors, with each processor capable of supporting multiple
threads. The difference in performance between scheduling a thread on any of
the cores of a sin-gle processor may be slight, but the difference in
performance when a thread is migrated to a different processor can be
significant, particularly if the data it is using is held in memory that is
local to the original processor. Memory affinity will be discussed further in
the section “The Characteristics of Multiprocessor Systems.”
following sections, we will discuss the components of the processor core. A
rough schematic of the critical parts of a processor core might look like
Figure 1.8. This
specialized pipelines for each instruction type, the on-chip memory (called cache), the translation look-aside
buffers (TLBs) that are used for converting virtual mem-ory addresses to
physical, and the system interconnect (which is the layer that is responsi-ble
for communicating with the rest of the system).
section, “Increasing Instruction Issue Rate with Pipelined Processor Cores,”
explains the motivation for the various “pipelines” that are found in the cores
of modern processors. Sections “Using Caches to Hold Recently Used Data,”
“Using Virtual Memory to Store Data,” and “Translating from Virtual Addresses
to Physical Addresses” in this chapter cover the purpose and functionality of
the caches and TLBs.