Performance analysis to optimize HPC applications is challenging at many levels, not the least of which is the availability of adequate performance analysis and measurement tools. Underappreciated at best, most organizations rely on vendor-supplied tools included as part of a machine procurement. While generally good for analysis on a single node, such performance analysis tools typically do not provide the capabilities needed to analyze heterogeneous systems containing accelerators and/or distributed applications running across large numbers of nodes. As a result, most programmers are stuck having to guess at performance issues. The patchwork nature and lack of consistency amongst performance tools available across various HPC centers also means that many programmers lack proficiency in using the performance tool(s) provided at a new site or installed on a new machine.
There are a number of excellent commercial performance analysis tools on the market, but their big drawback is that they cost money. As a result, the acquisition of commercial performance analysis software falls through the cracks, as most funding agencies discourage or prohibit the use of grant money for infrastructure improvements like software tools, and few grant authors are willing to take money away from research to purchase software.
Open-source tools, such as Open|SpeedShop, (or spelled more simply as OpenSpeedShop) are able to fill this gap and meet the needs of the HPC community by providing a free, commonly available, comprehensive tool to perform performance analysis on a single node or at scale on most HPC hardware platforms that are in use today. Free means that anyone can download the software and become adept in using it. Support for most HPC hardware platforms means that tools like OpenSpeedShop help programmers become self-taught portable optimization experts who are able to expertly work at most HPC organizations.
Community acceptance and direct funding from DOE’s NNSA and the Office of Science means that OpenSpeedShop has access to a wide variety of hardware platforms and some of the largest leadership class supercomputers in the world. This means the OpenSpeedShop developers can validate and wring out any issues with performance profiling on many different hardware platforms and at extreme scale. It is important to realize that collecting performance data — especially without affecting application performance — is a delicate problem that can conceivably generate more data than the ‘big data’ question being solved by the scientists. Consider, for example, the extraordinary amounts of data that can be collected from tens of thousands of nodes on a large petascale computer when running a job that takes minutes, hours or days. Once collected, the profiling data must then be made available in a reasonably interactive fashion so the programmer can peek, poke and prod to find the pain spots in the runtime behavior.
Performance profiling is more than just measuring time as programmers need to find critical sections of code and determine if they are running efficiently so they can decide where to spend their effort to achieve the greatest optimization payoff.
It sounds easy to just ‘find’ the ‘critical’ sections of code and start optimizing. The devil is in the details, which is what makes code optimization such a complex, multidimensional problem as runtimes, along with the locations of the ‘critical’ sections of code, can vary wildly depending on the input data, size of the problem, size of the machine, and components contained within the machine.
The typical workflow pattern for optimization consists of:
- measure performance for a representative problem, then
- analyze the performance data and
- make changes to the code and algorithms, after which
- the measurements are repeated, and the differences compared.
Most profiling tools tout their capability as an all-in-one suite of comprehensive tools — and OpenSpeedShop is no exception. The question to answer from a user perspective is what exactly do these, ‘comprehensive set of tools’ do? In particular, it is important that the software help the programmer find computational bottlenecks on multicore processors, as well as accelerators like GPUs and Intel Xeon Phi coprocessors, plus give the programmer the ability to analyze the performance of the I/O subsystems that handle both storage and network communications, as well as the memory subsystem that supplies data to the processors. Otherwise, programmers can fall prey to the adage that a supercomputer is an expensive device that turns a compute-bound problem into an I/O bound problem.
It is assumed that the profiler is able to collect data without significantly impacting the runtime of the application. In most cases, profiling is performed on an unmodified application binary, so there is no question of instrumented code affecting compiler optimizations. In essence, the profiler is attempting to observe without interfering.
When running, the application binary is sampled either statistically, where execution is periodically interrupted and the location of the program counter (or counters in the case of a parallel program) is recorded, or via event tracing where individual application events are recorded for all running application instances. Event tracing is wonderful because it provides very detailed information, but it can lead to huge volumes of data, which is why most optimization efforts start with statistical sampling to determine the highest use regions of code, and to determine if the application is stalled waiting for data.
Once the profile data is collected, it is up to the ingenuity of the individual analyzing the data to find any performance issues. The profiling tools provide interactive access to performance data, generally through a customizable Graphical User Interface (GUI) and via a database that provides fast access to pertinent information. Many profilers also perform basic analysis that attempts to highlight hotspots in the code or likely reasons for performance bottlenecks.
Once an issue is identified, developers make changes to the code and algorithms in an attempt to reduce or bypass the identified performance bottlenecks. The revised application is then re-profiled, and comparisons are made to see if, indeed, better runtime and hardware utilization is the result.
OpenSpeedShop appears to provide all the basics that support the profile -> analyze -> change -> compare workflow. It is up to the user to determine if the GUI and analysis tools are sufficient for their needs. Happily the open source nature OpenSpeedShop means that motivated developers can add additional functionality if any is found to be missing, or to add a useful new feature.
Is the open source nature of OpenSpeedShop sufficient? The success and ubiquity of the GNU software tools and other open source software development projects argue that the OpenSpeedShop project is on the right track. Significant investment by the Krell Institute, DOE, NNSA, Office of Science and other organizations reinforces the importance of a consistent, generally available profiling tool to the HPC community and confidence in the OpenSpeedShop project.
It’s definitely worth downloading a copy of OpenSpeedShop to see if it meets your needs. The cost (free) is certainly right! More information, plus the download section for Open|Speed Shop, can be found at http://www.openspeedshop.org.
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at [email protected].