People Make Petaflop Computing Possible
The heart of high performance computing technology still resides in the human component
I recently had the privilege of running one of the first applications to fully utilize the 62,976-core Texas Advanced Computing Center (TACC) Ranger supercomputer. Without a doubt, Ranger is a superb computational tool to enable scientists from around the world. This system delivered over one third of a petaflop (363 TF/s) of useful floating-point throughput on my application and exhibited near-linear scaling to 60,000 processing cores.
Having first access to the full machine of this leading-edge supercomputer emphasized, yet again, the importance of the administrative and consulting staff in making large supercomputers perform actual work. Modern supercomputers are essentially large collections of commodity hardware tied together via a very high performance communications fabric. This is great from a cost and performance standpoint, because it is very difficult to beat the price performance ratio of commodity hardware. The economics of mass production provide such a significant advantage that few of the current TOP500 supercomputers are built out of specialized hardware.
Building supercomputers out of commodity hardware effectively moves the complexity of the supercomputer into the software stack and communications fabric. For example, to achieve a substantial fraction of a petaflop in floating-point performance required Ranger to be built out of many thousands of separate computers — specifically 3,936 compute nodes in a Sun Blade 6048 Modular System with 15,744 Quad-Core AMD Opteron processors. The sheer number of computational units forces both software and hardware to operate at scales and performance levels never previously tested or installed in a production environment.
Not surprisingly, things break and, unfortunately, the failures manifest themselves in obtuse and non-obvious ways. In effect, the quest for faster supercomputers requires the construction of ever bigger and more complex custom-built prototype systems — despite the use of commodity components as basic building blocks. Due to the sheer expense of each latest-greatest generation of TOP500 class computers, manufacturers simply cannot afford to build and test these behemoths before bidding on a contract.
Commodity parts mitigate risk to a great degree. Manufacturers and buyers know that the individual components work extremely well on their own. As a general rule, we know that commodity computers and servers just work. The supercomputer catch is getting all the different parts to work together in harmony as a whole.
Going through the procurement process is exciting and tends to be dominated by a “think of the possibilities!” mentality. The excitement does not die down even after a
click to enlarge
Figure 2: Observed peak effective rate versus number of Ranger cores
vendor is selected and the contracts are negotiated and signed. Suddenly, the supercomputer starts becoming real, as a complex migration and installation process — based on a tremendous amount of pre-planning and a few educated guesses — swings into motion. Finally, a magic day arrives when racks of equipment are installed, power is applied and the first programs run. Hurray, the supercomputer works!
Then some user runs a big problem and something that should obviously work fails. Whatever it is that breaks generally fails in some silent and uninformative manner. What then?
At this point, the quality of the supercomputer support staff, their relationships with the vendor(s) and their trouble-shooting skills become the defining components that make the multi-million-dollar supercomputer investment work. Regardless of how fantastic the technology, it is the people on staff that make it happen.
Of all the complexity in my particular application, a simple one-line call to communicate some necessary startup information to all the message passing interface (MPI) processes in the job turned out to be the most difficult stumbling block. “Small” runs using around 12,000 processing cores would work fine, while larger runs would fail. For me, this was an easy problem to identify, as I only had to put print statements around one line in my code to show that a single MPI global broadcast was called by each process but the call never returned. Simple for me, I might add, because fixing this issue then became the problem of the “human” part of the supercomputer — in this case, the TACC consulting and support staff and manufacturer support team.
This failure ended up being a “good news/bad news” problem. The good news was that the problem was reproducible. The bad news was that it only happened with large jobs. Ranger is a valuable resource (as are all large supercomputers) that is in high demand. It is difficult to get access to large portions of these production machines for testing and debugging purposes. The manufacturer generally does not have a large-scale system — representing millions of dollars of investment — sitting around, so how do you fix a problem when you cannot run tests and have precious few opportunities to try solutions?
Unfortunately, the answer is that it takes time. In many cases, the solution to these problems boils down to some person in a cubicle staring at endless lines of software to try and figure out a race condition or some other obscure interaction with the hardware that caused a failure.
Several changes occurred on Ranger as we continually increased the run size of my application from 12,000 cores through a series of incremental sizes to 60,000 cores. Some modules were replaced in the software stack (the series of modules and drivers used by my application to effectively utilize the hardware), as well as system-wide file system changes and upgrades to critical hardware components. Clearly, the modifications were effective, as can be seen in the resulting performance graph (Figure 2), as Ranger delivered near-linear performance scaling from 32 to 60,000 cores. Please note that this graph reports results in terms of effective flops, a measure that intentionally includes all time the processors spend communicating with each other (thereby reducing the floating-point efficiency). Although the graph looks linear, there is a slight deviation at the beginning as the Ranger communications network has to include more switches to connect all the processors together for the specified run size. (Adding more switches increases the communications latency, as packets of information must make more hops through additional switch layers as they move between machines.)
Of course, I believe all the work was justified, since it advanced the research I am pursuing with my collaborator, Harold Trease — a computational physicist at PNNL, as we teach computers to identify the objects they “see” in unstructured images from cameras operating in uncontrolled real-world environments. For more information on our research, I recommend reading the TACC feature article, “Identifying Opportunities” at www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber.
If you are planning to attend the SC08 conference this year, please realize as you walk around and marvel at the science being performed on TOP500 machines that the heart of the technology still resides in the human component.
Rob Farber is a senior research scientist in the Molecular Science Computing Facility at the William R. Wiley Environmental Molecular Sciences Laboratory, a Department of Energy national scientific user facility located at Pacific Northwest National Laboratory in Richland, WA. He may be reached at editor@ScientificComputing.com.