Commonly
Available Profiling Tools
Most
modern profiling tools do not require you to do anything special to the
application. However, it is often beneficial to build the application with
debug information. The debug information can enable the tools to aggregate
runtime at the level of individual lines of source code. There are also two
common approaches to profiling.
The first
approach is system-wide profiling. This is the approach taken by tools such as
Intel’s VTune, AMD’s CodeAnalyst, and the open source profiling tool oprofile. The
entire system is inspected, and timing information is gathered for all the
processes run-ning on the system. This is a very useful approach when there are
a number of coordi-nating applications running on the system. During the
analysis of the data, it is normal to focus on a single application.
Figure
2.9 shows the output from the AMD CodeAnalyst listing all the active processes
on the system.
In this
instance, there are two applications using up almost all of the CPU resources
between them.
The
second common approach is to profile just the application of interest. This
approach is exemplified by the Solaris Studio Performance Analyzer. Profiling a
single application enables the user to focus entirely on that application and
not be distracted by the other activity on the system.
Regardless
of the tool used, there are a common set of necessary and useful features. The most
critical feature is probably the profile of the time spent in each function.
Figure 2.10 shows the time spent in each function as reported by the Solaris
Studio Performance Analyzer.
The
profile for this code shows that about 70% of the user time is spent in the
rou-tine calc() with the
remainder spent in the routine __write().
Profile
data at the function level can be useful for confirming that time is being
spent in the expected routines. However, more detail is usually necessary in
order to improve the application. Figure 2.11 shows time attributed to lines of
source in the AMD CodeAnalyzer.
Using the
source-level profile, most developers can make decisions about how to
restructure their code to improve performance. It can also be reassuring to
drop down into assembly code level to examine the quality of the code produced
by the compiler and to identify the particular operations that are taking up
the time. At the assembly code level, it is possible to identify problems such
as pointer aliasing producing subopti-mal code, memory operations taking
excessive amounts of time, or other instructions that are contributing
significant time. Figure 2.12 shows the disassembly view from the AMD
CodeAnalyst.
Some
tools are able to provide suggestions on how to improve the performance of the
application. Figure 2.13 shows output from Apple’s Shark tool, which suggests
improving performance by recompiling the application to use SSE instructions.
Many
performance problems can be analyzed and solved at the level of lines of source
code. However, in some instances, the problem is related to how the routine is
used. In this situation, it becomes important to see the call stack for a
routine. Figure 2.14 shows a call chart from Intel’s Vtune tool. The figure
shows two threads in the application and indicates the caller-callee
relationship between the functions called by the two threads.
An
alternative way of presenting caller-callee data is from the Oracle Solaris
Studio Performance Analyzer, as shown in Figure 2.15. This hierarchical view
allows the user to drill down into the hottest regions of code.
Another
view of the data that can be particularly useful is the time line view. This
shows program activity over time. Figure 2.16 shows the time live view from the
Solaris Studio Performance Analyzer. In this case, the time line view shows
both thread activity, which corresponds to the shaded region of the horizontal
bars, together with call stack information, indicated by the different colors
used to shade the bars. Examining the run of an application over time can
highlight issues when the behavior of an application changes during the run. An
example of this might be an application that develops a great demand for memory
at some point in its execution and consequently spends a period of its runtime
exclusively in memory allocation code.
A similar
view is available from Apple’s Instruments tool, as shown in Figure 2.17. An instrument is the name given to a tool
that gathers data about processor, disk, network, and memory usage over the run of an application. This particular
example shows proces-sor utilization by the two threads over the run of the
application. The time line view is particularly useful for multithreaded
applications. To get the best performance, the work needs to be evenly divided
between all threads. The time line view is a quick way of telling whether some
threads are more active than others. It can also be useful for codes where a
synchronization event, such as garbage collection in the case of Java, causes
most of the threads in an application to pause.
Performance
analysis tools are critical in producing optimal serial and parallel codes.
Consequently, it is important to become familiar with the tools available on
your system. For serial codes, a performance analysis tool will identify the
region of code that needs to be improved to increase the performance of an
application. For parallel codes, they will allow you to identify regions of
code where the parallelization could be improved or where the work could be
better distributed across the available processors or threads.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.