Massive Milestone toward Exascale Debugging Achieved
At the SC12 conference, Rogue Wave Software a provider of cross-platform software development tools and embedded components for the next generation of HPC applications, announced that TotalView has achieved a significant debugging milestone during testing conducted as part of its strategic scalability initiative. During the testing, TotalView demonstrated its capability to debug a parallel job running on 786,432 processor cores.
The tests were conducted on Lawrence Livermore National Laboratory’s (LLNL) Sequoia, its IBM Blue Gene/Q supercomputer. Sequoia serves the National Nuclear Security Administration’s Advanced Simulation and Computing (ASC) program, a cornerstone of the effort to ensure the safety, security, and reliability of the nation’s nuclear deterrent without underground testing.
“We are actively working to increase the capabilities of our scientific codes to scale and take advantage of the phenomenal power of Sequoia. As part of this effort, we are looking for ways to get more on-node parallelism from existing codes and architecting our new codes to support the even more massive degrees of parallelism that we know will be needed in the future,” stated Scott Futral, LLNL group leader for Development Environment. “Rogue Wave’s dedication to pushing for ever-increasing scales with its TotalView debugger and the recent tests give us reason to be confident that TotalView will continue to be a critical development tool as we reach higher and higher scales with our own codes.”
Rogue Wave’s scalability initiative, which is a partnership with LLNL and LLNL’s Tri-Lab partners (Los Alamos National Laboratory and Sandia National Laboratory), features a multi-architecture approach, targeting the Blue Gene/Q platform, along with x86-based architectures, like the Cray XE. Extreme-scale testing allows TotalView engineers to identify bottlenecks and prioritize efforts in optimizing and tuning the debugging engine for scalability. During the most recent testing session, TotalView successfully scaled across 786,432 cores, with no indication of the debugger hitting any barriers.
Rogue Wave conducted this test using a hybrid MPI + OpenMP code that implements a method for solving a system of linear equations. This application, which makes use of both MPI for distributed memory multi-process parallelism and OpenMP for shared memory thread based parallelism, was selected because it shares important characteristics with many applications used on extreme scale systems, such as Sequoia. This kind of attention to the workloads of large-scale systems is another key aspect of scalability requirements.
Since there was no indication of any barrier being hit at the 786,432 core mark, the testing suggests that TotalView could have leveraged more of Sequoia’s 1.5 million cores if additional compute nodes had been available. In order to further push TotalView’s scalability, additional tests oversubscribed the machine by spinning up more than one thread per core.
About TotalView
TotalView is a highly scalable debugger that provides troubleshooting for a wide variety of applications including: serial, parallel, multi-threaded, multiprocess, and remote applications. Designed for developer productivity, TotalView simplifies and shortens the process of developing, debugging, and optimizing complex code. It provides a unique combination of capabilities for pinpointing and fixing hard-to-reproduce bugs, memory leaks, and performance issues. TotalView raises the bar for debugging by providing debugging for CUDA, OpenACC and deterministic reverse debugging, which allows users to pause, rewind and playback the sessions to accurately identify and correct errors.
About Rogue Wave Software
Rogue Wave Software is the largest independent provider of cross-platform software development tools and embedded components for the next generation of HPC applications. Rogue Wave marries high performance computing with high productivity computing to enable developers to harness the power of parallel applications and multicore computing. Rogue Wave products reduce the complexity of prototyping, developing, debugging and optimizing multi-processor and data-intensive applications. Rogue Wave is a Battery Ventures portfolio company.