Translating Great Power into Great Productivity
Army Research Laboratory achieves 90-percent utilization
Researchers associated with the Department of Defense (DoD) have relied on supercomputers for over 50 years. This tradition continues today at the Army Research
Laboratory Major Shared Resource Center (ARL MSRC), which helps the DoD exploit supercomputing technology for military advantage throughout the battlespace. With a mandate to help DoD scientists and engineers develop, test and field weapons systems faster and more efficiently, the MSRC relies on some of the world’s newest, most scalable and most powerful supercomputers.
The DoD High Performance Computing Program (HPCMP) commitment to making the ARL MSRC one of the world’s leading centers of supercomputing power was recently cemented with the procurement of four supersystems that, when combined with existing resources, will put the ARL Center at more than 100 TFLOPS. These systems include a 1024-node cluster with 4096 3.0 GHz Intel Woodcrest cores for computation, increasing the MSRC’s computational capability by more than 50 TFLOPs; an 842-node system with Intel Dempsey cores delivering 21.5 TFLOPS; a test and development system and a 64-node scientific visualization system.
Two of these systems are expected to rank among the top 15 of the world’s most powerful supercomputers and the scientific visualization supercomputer has broken several performance records.
But great power does not automatically translate into great productivity. Like every organization in the world that relies on computing power, the MSRC is constantly challenged with the need to utilize these resources more effectively. This challenge is especially acute when you consider that the money spent on supercomputing is spent on behalf of the general public welfare — and the weapons and systems being analyzed are intended for the security and safety of the entire nation.
So the challenge to the MSRC — as well as to computing centers all over the world — is not, “How powerful is your system?” but “Can you make the most out of it?” A leading provider of supercomputing, the MSRC has driven 90 percent utilization on even the most powerful supercomputers — not an easy task. Ensuring this kind of result takes a careful, deliberate approach and partnering with vendors who have the expertise to assist in the designing, building, tuning, benchmarking and validation of the system.
Understanding the mission and defining the system
The first step in ensuring that supercomputing resources are highly utilized is deeply understanding the mission and defining systems to meet its specific challenges. The
MSRC supports a broad range of research and analysis programs across DoD ranging from simulating battlefield weather and tactical communications networks to basic research in the physical sciences, engineering, mathematics and information science. Other programs supported include work centered on the role of technology and the modern warfighter, novel weapons and propulsion technologies, advanced robotics, power and energy sources and sensors, and survivability technology. Many of these complex programs require millions of hours of scalable, parallel computational power to analyze extremely large problem sets.
The ARL MSRC was the first DoD HPCMP center to explore the use of commodity clusters to meet DoD’s computational requirements. The testbed was developed in 2000, comprising a 16-node Compaq Alpha cluster and an eight-node IBM Intel system. The testbed was further expanded in 2003 to include a 256-processor Linux Networx system.
Their effectiveness proven, three new supercomputing clusters were procured in 2004 to meet the challenges posed by such research programs. These included the JVN, a system named after John von Nuemann; ranked 13th in 2004, it now ranks 33rd on the Top500 list. The JVN is a 2048-processor Linux Networx supersystem featuring Intel Xeon processors. To supplement the hardware, the ARL carefully selected key software applications. The ARL MSRC relies on comprehensive software tools including grid generation programs, flow solver software, mathematical libraries and parallel programming tools. Debugging tools such as Totalview and Vampir are available as postprocessing software for scientific visualization. Additional software installed on the system includes ANSYS, CTH, ICEPIC, MATLAB, Monocode, CFD++, Fluent, and Pro/engineer.
Steps to ensuring high utilization
Defining and procuring the system is just the start. High utilization rates can only be achieved through a well-structured process and experience gained from the cluster
testbed. The key to that process is developing close working relationships with vendors. With the John von Nuemann system, the MSRC formed a strategic working group with Linux Networx, Intel and Metacomp Technologies — whose application, CFD++, is critical to the Center’s user applications such as aerodynamics of guided weapons and aerothermodynamics for hypersonic weapons technologies.
This group convened weekly to review everything from the system itself to industry innovation that might influence the definition and optimization of the system. Beginning prior to the first piece of hardware seeing the light of day, every step was planned and reviewed exhaustively with the ultimate goal — 90 percent utilization — always in mind.
As the working group met, it became clear that an essential component of the system development and optimization process would be the creation of a test and development system (TDS), which was built into the original procurement. Comprised of 16 nodes, it was installed with an identical software stack to the JVN system and included much of the same hardware.
Working with their group partners, the Center’s software engineers integrated the TDS into the MSRC infrastructure and created application and runtime environments consistent with other ARL systems. The team worked to install, tune and validate the many third-party applications used by MSRC customers. At this point, the researchers and scientists that the Center serves were invited to port codes over to the TDS to ensure that their applications would build and run on the JVN. Some of these researchers have in excess of one million CPU hours of allotted time, which meant that getting them up and running on the TDS would pay big dividends later.
The next step was to create an initial batch scheduler configuration, tuned over time to keep the scheduling consistent with the workload. In tuning the system, the Center had to keep in mind both observed and desired workloads, which meant that they had to keep in close contact with the researchers and scientists who would ultimately use the system. Next, the MSRC consolidated maintenance activities in order to minimize service interruptions and worked with its partners to thoroughly test software updates on the TDS. In doing so, they could ensure that, if JVN had to be taken down for an upgrade, the downtime would be well spent.
As the team got closer to being able to put the JVN into production, focus was on making sure that queue runtime lengths were consistent with the hardware and software
reliability of the system. In addition, the working group collaborated to aggressively pursue and correct any problems with system-wide resources, such as the parallel file system or batch scheduler. Having jobs run for many days and failing due to system problems would render all the effort put into ensuring high utilization wasted.
With these steps completed, JVN was ready to be tuned, optimized and put into production. In May 2005, it achieved a performance of 10.65 trillion operations per second (teraflops), making it the most powerful production supercomputer at the MSRC and delivering the sought-after 90-percent utilization rate. As time went on, the steps taken to optimize the system were repeated with major software changes, thereby ensuring a return to high utilization rates once the system returned to production.
Putting John von Nuemann to work
Since it went into production in May 2005, JVN has supported various applications including electromagnetic, penetration mechanics, aerodynamics and computational fluid dynamics. Among the numerous codes, JVN was originally tuned for use with electromagnetic codes such as ICEPIC and Monocode. A recent ICEPIC project, which has used more than one million processor hours in recent months, is supporting the design and development of high power microwave sources for the disruption/destruction of electronics. Another project using Monocode studies the improvement of millimeter-wave radar signature prediction for ground vehicles.
JVN also has been used recently to assist the Air Force Seek Eagle Office at Eglin Air Force Base, FL, to perform engineering analyses for Air Force weapons certification efforts and to provide the war fighter with the maximum combat capability for all weapon systems. Using engineering analyses, flight test results, or a combination of both,
acceptable flight limits and conditions are determined in which the war fighter may safely carry and employ a weapon system. To accomplish this task, computational fluid dynamics (CFD) calculations are made for store loads, store separations, flutter, stability and control, vibrations, ballistic accuracy, and safe escape.
Understanding the mission and defining the system, again
The more success you have, the greater the problems you encounter. JVN helped solve many of the problems of the past, but that has merely revealed new, more intractable, data-intensive problems that require additional supercomputing resources to solve.
For instance, behind armor debris is a major cause of damage in military vehicles that have been perforated by a penetrator, bullet or fragment. The ability to predict the debris field resulting from an attack is critical to assessing and improving the survivability of tactical systems. Modeling of the debris field has historically been done by statistically analyzing data from carefully controlled experiments — a lengthy and expensive process. Supplementing these experiments with numerical simulations has not yet been successfully exploited, since previous computer systems could not power the sophisticated and massive problems.
Challenges such as this are the driving force behind the latest DoD HPCMP procurements for the MSRC. As with JVN before, the systems are undergoing a rigorous, deliberate, optimizing process. The ARL has teamed with Linux Networx and other vendors to ensure that these systems arrive ready for production-at-power-up and deliver the highest possible utilization rates.
Mr. Nietubicz is Division Chief for the High Performance Computing Division, Computational & Information Sciences Directorate, Army Research Laboratory, Aberdeen Proving Ground, Maryland. Thomas Kendall is Chief Engineer of the U.S. Army Research Laboratory Major Shared Resource Center (ARL MSRC). They can be reached at editor@ScientificComputing.com.