Processors and Scaling
The two big advantages of multicore processors are their ability to run
multiple threads and the low synchronization costs between those threads.
Synchronization costs govern scaling in two important ways. Low synchronization
costs mean that the code will scale more effectively to higher thread counts. A
similar reasoning leads to the enticing possi-bility that low synchronization
costs enable developers to produce parallel versions of routines that were
previously too small to parallelize. Consequently, there is a fortuitous
convergence that processors with low synchronization costs that enable scaling
to high thread counts also provide the threads that will perform that scaling.
The low synchronization costs lead to one further
advantage for multicore processors. This section has discussed a number of
reasons that scaling could be limited, and a num-ber of these are implicitly
functions of the communication costs between cores. The most obvious example of
this is false sharing.
The cost of false sharing is that updates to a
cache line depend on the cache to cache communication latency between the cores
where the threads are running. This cost is typically of the order of memory
latency. On a multicore processor, the communication cost between two threads
is the latency of the closest shared level of cache between the two processors.
Since this cache is usually on chip, the latency is often an order of
mag-nitude less than memory latency. If we take the code demonstrating false
sharing from Listing 9.20 and run it on a multicore system, the increase in
runtime due to false shar-ing is minimal.
However, these benefits from multicore processors are likely to
disappear if the system is a multiprocessor system. If the system has multiple
processors, then it becomes more important
to consider the memory locality of the data that a thread is using. The
migra-tion of a thread between processors will cause local data to become
remote. If data is shared between threads, some of those threads might see
local, and some might see remote access costs. For optimal performance, it may
be appropriate to consider binding threads to virtual CPUs.
With a multicore, multiprocessor system, there is the
question of whether it is better to spread the work across all the chips or
whether it is better to constrain it to within a single chip. Using multiple
chips may provide more instruction issue bandwidth, and it may also provide
more memory bandwidth, but the cost will be increased communica-tion latency
between threads. Constraining all the threads to reside on a single chip will
provide the best communication latency but may not provide optimal instruction
issue width and memory bandwidth.
Although multicore processors present great
opportunities for running parallel work-loads, they also have constraints. Most
critical is the sharing of resources between the vir-tual CPUs. These resources
might be processor bandwidth, instruction pipeline, or cache. These constraints
will have an impact on the scaling of a single process as the number of threads
increases. Scaling to low numbers of threads will often be close to linear, but
scaling to higher numbers of threads may demonstrate limitations of both the hardware
Even within a single multicore chip, it may be
worth considering binding threads to virtual processors. The optimal assignment
for work will probably be achieved by placing as few threads as possible on
each core. This is a task that operating systems should per-form automatically,
but there may be situations where this does not happen. One exam-ple might be
when there are multiple processes active on the machine, making it hard for the
OS. The OS may make the decision to place the threads where there is spare
compute resource rather than placing threads optimally for the process.