Multicore Processors and Scaling
The two big advantages of multicore processors are their ability to run multiple threads and the low synchronization costs between those threads. Synchronization costs govern scaling in two important ways. Low synchronization costs mean that the code will scale more effectively to higher thread counts. A similar reasoning leads to the enticing possi-bility that low synchronization costs enable developers to produce parallel versions of routines that were previously too small to parallelize. Consequently, there is a fortuitous convergence that processors with low synchronization costs that enable scaling to high thread counts also provide the threads that will perform that scaling.
The low synchronization costs lead to one further advantage for multicore processors. This section has discussed a number of reasons that scaling could be limited, and a num-ber of these are implicitly functions of the communication costs between cores. The most obvious example of this is false sharing.
The cost of false sharing is that updates to a cache line depend on the cache to cache communication latency between the cores where the threads are running. This cost is typically of the order of memory latency. On a multicore processor, the communication cost between two threads is the latency of the closest shared level of cache between the two processors. Since this cache is usually on chip, the latency is often an order of mag-nitude less than memory latency. If we take the code demonstrating false sharing from Listing 9.20 and run it on a multicore system, the increase in runtime due to false shar-ing is minimal.
However, these benefits from multicore processors are likely to disappear if the system is a multiprocessor system. If the system has multiple processors, then it becomes more important to consider the memory locality of the data that a thread is using. The migra-tion of a thread between processors will cause local data to become remote. If data is shared between threads, some of those threads might see local, and some might see remote access costs. For optimal performance, it may be appropriate to consider binding threads to virtual CPUs.
With a multicore, multiprocessor system, there is the question of whether it is better to spread the work across all the chips or whether it is better to constrain it to within a single chip. Using multiple chips may provide more instruction issue bandwidth, and it may also provide more memory bandwidth, but the cost will be increased communica-tion latency between threads. Constraining all the threads to reside on a single chip will provide the best communication latency but may not provide optimal instruction issue width and memory bandwidth.
Although multicore processors present great opportunities for running parallel work-loads, they also have constraints. Most critical is the sharing of resources between the vir-tual CPUs. These resources might be processor bandwidth, instruction pipeline, or cache. These constraints will have an impact on the scaling of a single process as the number of threads increases. Scaling to low numbers of threads will often be close to linear, but scaling to higher numbers of threads may demonstrate limitations of both the hardware and software.
Even within a single multicore chip, it may be worth considering binding threads to virtual processors. The optimal assignment for work will probably be achieved by placing as few threads as possible on each core. This is a task that operating systems should per-form automatically, but there may be situations where this does not happen. One exam-ple might be when there are multiple processes active on the machine, making it hard for the OS. The OS may make the decision to place the threads where there is spare compute resource rather than placing threads optimally for the process.
Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.