Developing Fault-tolerant Software
A shift in design paradigms is needed to accommodate hardware failure
Tera-scale yesterday, peta-scale today, exa-scale next week — this seems to be the fervor of high performance computing. Ever-increasing numbers of hardware components are being connected with interconnect hardware that is increasing the number and density of ports per volume, and increasing the amount of memory and available storage. What is lagging behind this increase in computing power is the developed software that effectively utilizes these new machines.
Many of the existing scientific codes were designed for machines consisting of tens to hundreds of processors. In the tera-scale range, we graduate to thousands of processors. We are approaching the physical limits of how fast we can clock a processor and how small we can make its circuits and still have reliable results. To generate a potentially faster, more powerful compute engine, we connect more pieces together. As we increase the component count, we also decrease the mean time to failure of some components, not individually, but collectively: the longer a job runs on greater numbers of pieces of hardware, the closer one comes to the probability that one or more of the involved hardware components will fail in some catastrophic way.
Some effort has been devoted to measures, such as being able to predict hardware that is about to fail, making it hot swappable, and proactively rescheduling software running on parts about to fail. While these steps will help in some situations, there will still be failures we don’t catch in components we don’t monitor.
Another solution is to have redundant hardware, but redundant hardware, even at tera-scale computing levels, is prohibitively expensive. There are also advocates of checkpoint restarting. However, for sufficiently large systems, the cost and logistics of check-pointing massive volumes of distributed memory becomes prohibitive.
Perhaps a more effective path to surmounting the difficulties presented by the large-scale hardware environments is to rest the responsibility for fault-tolerant computing on the shoulders of the software design community instead of in the hardware arena. Progress already has been made in the arena of developing fault-tolerant software. This might be categorized into two realms: data-centric software and process-centric software. An excellent example of a data-centric approach will be presented at the International Conference on HPC this December.1
In this article, I explore a process-centric strategy that I see as a necessary shift in our software design paradigms in order for process-centric HPC codes to accommodate these failures, as well as some steps that have been taken to achieve these objectives and potential future directions. I will further restrict my focus to a more tightly coupled HPC environment that has been my domain of practice for the past several years,2 though some of these ideas may find broader application in the more loosely coupled environments of cloud computing. I have used these techniques for developing codes in the bioinformatics arena, one for tandem mass spectra analysis,3 and the other involving sequence analysis.4 Other examples of similar strategies have been examined for algorithms in linear algebra.5
Generating software that can effectively and reliably use tera-scale and beyond hardware requires us to discard some assumptions that have been fundamental to scientific HPC codes of the past. While there are several, such as that input/output operations never fail and are relatively inexpensive, and that communications calls always succeed, time and space only permit one discussion. Therefore, I have chosen one of the more difficult assumptions to discard — the notion that a consistent set of resources is available for the duration of a computation. Many simulation codes, whether they be molecular dynamics, quantum mechanics, climate modeling, subsurface geochemistry or a host of others, make this assumption, as evidenced by their use of global participation calls, such as MPI_Reduce, or of global barriers as synchronization points.
A global participation call is simply one where every processor is required to participate. The difficulty caused by these calls is that, if any processor fails, the rest are hung at the global participation call waiting for participants that will never arrive, since they have ceased to function. So, if we are not to use global participation functions, what should we do instead?
It is necessary for any software to have information about its environment, a sort of self-awareness of what are the available resources. So, a useful replacement for the assumption of a fixed set of available resources is a somewhat weaker assumption: that, at the beginning of the job, the resources with which we start are known, and that they may disappear at any time, but only a few at a time. This is frequently the case with our current large Infiniband-based cluster. Some nodes are always experiencing some hardware difficulty, such as a bad disk, memory chip, network connection or such, but only a few at a time. In the future, it would be great to be able to relax this assumption even further to allow for an application to acquire additional resources.
Some success has been seen recently that avoids these global participation calls with the server-client strategy. Where tasks are dynamically handed out from a server to clients, the clients execute the tasks and send the results to the server, which accumulates the results. Two well-known examples are MapReduce6 and the BOINC paradigm.7
The client initially sends a “report for duty” message to the server, and receives initialization information and a task to perform. When finished executing the task, the client exchanges its result for another task, which might be an all-finished task-quit. The server hands out the tasks to be executed and, in order to tolerate that some client may fail, tracks task completion. After handing out all of its tasks, it passes out incomplete tasks to clients as they finish tasks. The server cares not which client finishes a multiply assigned task first, just that it finishes, and subsequent replicas of that task’s results are ignored or used for verification. When all tasks have been completed, the server issues a task-quit to those clients still involved in redundant computation of the last tasks.
This strategy works fine for applications that are termed embarrassingly parallel, where all tasks may be executed concurrently, but many modeling codes have an ordering to their tasks. Some may be executed concurrently, but must be finished in order to define the input values for subsequent tasks, and these lead to global synchronization/accumulation points in the code. The single-server-many-client model can be adapted to handle this simply by looping over sets of tasks (call them stages) and, at the end of a stage, sending a “finish stage” message to the clients.
One of the performance issues in a tightly coupled environment using this approach is that, as tasks toward the end of a stage’s task list may be handed out multiple times, the last task to finish is likely to be passed to each client. If the last tasks are very short in duration, the effect will be negligible. So, one strategy to relieve this termination delay is to sort concurrently scheduled tasks by their expected run times when those are easily computable. When the tasks are longer in duration and less predictable, it is useful to provide a checking mechanism for the clients to test on a regular interval whether the stage has been finished and they should stop their current task to prepare for the next stage. Making that interval length a tunable parameter allows for balancing the overhead of checking the stage state with the amount of redundant work you are willing to tolerate at the end of a stage.
While the above server/client strategy avoids global participation, it still suffers from some maladies at very large scale. The primary difficulty is that single-server many-client programs can overwhelm the communication fabric and its software stack, as we have seen on our current cluster. While this is somewhat dependent on the communication library implementation and the interconnect, it is to be expected for any sufficiently large HPC cluster. As in a corporate environment, there is a limit to how many workers with which a single manager can effectively communicate and manage. A solution for HPC is similar to that of corporate management: hierarchically managed workers. A middle layer of servers provides communication throttle control for the primary server, so as not to overwhelm the communication infrastructure, allowing codes to scale to the full width of our computing resource.
Any approach is not without its drawbacks, and one needs to keep the proper balance so as not to have too many managers and not enough workers. Extra layers of communication also increase overhead and resource demand. The relevant question is: How much overhead is tolerable? Experience suggests that a fair amount of overhead might be well worth the avoidance of failed jobs.
A weak point of the hierarchically managed strategy is that, if a sub-manager process crashes, all processes that report to that sub-manager are lost. This may slow things down, yet it is not catastrophic and might be controlled by appropriate choice of task group size. More serious is the loss of the primary server or manager process, that is indeed fatal to a job. However, the mean time between failures (MTBF) for a particular component is usually large enough that the probability that the processor on which your manager process is running will fail is fairly low. Nonetheless, one could envision having a backup heartbeat mirror for the manager (and possibly the sub-managers) to decrease the likelihood of catastrophic failure caused by any component of the system.
In summary, then, as we move into the peta- and exa-scale arena for HPC, we the software community need to think outside of the fixed resources with global synchronizations box and design locally synchronized, dynamically scheduled, and hierarchically managed applications that can complete computations despite the expected modest number of hardware component failures.
Doug Baxter is a consultant and research scientist for the Molecular Science Computing visualization and consulting group, WR Wiley Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory. He may be reached at [email protected].
REFERENCES
1. Abhinav Vishnu, Huub Van Dam, Wibe De Jong, Pavan Balaji, and Shuaiwen Song, Fault- Tolerant Communication Runtime Support for Data-Centric Programming Models, to be presented at HiPC 2010, Goa, India, Dec. 19-22 2010.
2. William R. Wiley Environmental Molecular Sciences Laboratory Molecular Science Computing national user facility:
www.emsl.pnl.gov/capabilities/computing/msc/supercomputing/supercomputer.jsp
3. Cannon WR, KH Jarman, BM Webb-Robertson, DJ Baxter, CS Oehmen, KD Jarman, A Heredia-Langner, GA Anderson, and KJ Auberry. “A Comparison of Probability and Likelihood Models for Peptide Identification from Tandem Mass Spectrometry Data,” J. Proteome Res., 2005 4(5):1687-1698
4. Oehmen, C.S., J. Nieplocha. “ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis,” IEEE Trans. Parallel Dist. Sys., (2006), 17(8), 740-749.
5. Parry Husbands and Katherine Yelick, Multi-Threading and One-Sided Communication in Parallel LU Factorization, SC07 November 10-16, 2006 Reno, Nevada, USA.
6. MapReduce: Simplified Data Processing on Large Clusters. OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, Dec. 2004.
7. David P. Anderson. BOINC: A System for Public-Resource Computing and Storage.
5th IEEE/ACM International Workshop on Grid Computing. November 8, 2004, Pittsburgh, PA.