Scientific Process Automation Improves Data Interaction
Workflow infrastructure automates time-intensive manual processes
click to enlarge
FIGURE 1: Storage growth over time
As computers become more powerful, they are better able to provide answers to increasingly complex scientific questions. Unfortunately, answering these questions comes with significant costs: scientists become intimately familiar with hardware platforms, learn specialized languages and data formats to efficiently represent results, and actively shepherd their simulation runs to identify and correct potential problems as early as possible. In practice, this means computational scientists spend significantly more time managing their data and babysitting their runs rather than doing science.
The situation worsens as the computational power of the largest high performance computing (HPC) platforms increases. While scientific simulations continue to approach full-scale, full-physics models, hardware becomes increasingly complex (i.e., more cores, more processors, more interconnects), and the data produced becomes enormous. Significant resources must be available to store and analyze the results of each simulation run. For example, Figure 1 illustrates the increased storage requirements at Oak Ridge National Laboratory (ORNL).
click to enlarge
FIGURE 2: “Napkin View” of TSI workflow Courtesy of D. Swesty, Stony Brook
The U.S. Department of Energy (DOE) Scientific Discovery through Advanced Computing (SciDAC) program1 has assembled several multi-institutional “Centers for Enabling Technology” that are charged with developing solutions to the challenges faced when bringing peta-scale computing to computational science. These centers focus on a variety of computer science and applied mathematics disciplines such as visualization, performance modeling, linear solvers and optimal mesh generation. The SciDAC Scientific Data Management (SDM) Center is focused on improving scientists’ ability to interact with their data in three key areas:
• reading and writing data through storage-efficient access
• analysis of large data sets to find features of interest using data mining and analysis
• automation of the overall simulation and analysis process through scientific process automation (SPA)
SciDAC’s SPA team is focused on developing and deploying automated workflows for a variety of computational science domains. Scientific workflows are the formalization of a scientific process that is frequently and repetitively performed. Figure 2 shows a high-level description of a typical scientific process. The SPA team faces the challenge of converting these manual, time-intensive processes into a workflow that can be executed with minimal or no supervision. The team tackles the challenge through the use of a workflow engine such as Kepler (www.kepler-project.org),2 in which an executable version of the process is defined, configured and run. Kepler acts as the orchestrator of the workflow, coordinating both the data transfers to and execution of individual components (called actors) that perform specific tasks.
click to enlarge
FIGURE 3: Executable TSI workflow
Workflow technology has been available in the business environment for many years to automate business processes, such as approving timecards and handling invoices. However, workflow technology for scientific domains is more complex in a number of ways. They are data-intensive (for instance, manipulating terascale data sets). They also often require execution of tasks in parallel, in a distributed, heterogeneous environment.3 Fortunately, most scientific simulation runs follow a similar pattern (Figure 2):
• configure simulation input
• transfer the code and the input to an HPC machine
• submit the job request
• wait for the job to start
• monitor the job progress (queued, running, completed)v • transfer output files to an analysis machine v • run the analyses (usually an iterative process)
• archive results
Of course, the details of each step and the order of the steps can vary greatly. For example, in some cases
• Files must be transferred off the HPC machine while the simulation is still running.
• The number and size of machines involved may vary dramatically.
• The data analysis may begin while the simulation is running.
These common requirements, however, form the foundation and templates on which to build a scientific workflow infrastructure.
The SPA approach
Several workflow environments are currently available and in use in both the scientific and business communities. SPA uses the Kepler workflow engine for a number of reasons:
click to enlarge
FIGURE 4: Dashboard interface Courtesy of C.S. Chang, NYU
• It is an open source workflow environment built on the Ptolemy II engine, which has been available for more than a decade. Having a distributable and extensible environment is a key requirement for SciDAC.
• There is a strong developer community focused on enhancing and extending the Kepler infrastructure. This community, consisting primarily of researchers conducting projects funded by DOE and NSF, includes significant support outside of UC Berkeley (where Ptolemy was initially created). This activity dramatically increases the longevity of the technology.
• Kepler provides multiple mechanisms for control flow and data flow within the workflow engine, making it easy to perform tasks in parallel or to force sequential execution of specific steps. This flexibility is important in scientific environments.
• It allows nested workflows, which enables workflow designers to provide appropriate levels of abstraction.
• Kepler is written in Java and is, thus, portable across most computational platforms.
• It has a GUI that allows workflow developers to graphically define workflows. This increases the accessibility and potential user base of the tool.
In addition to developing and deploying workflows for specific scientific tasks (e.g., the workflow in Figure 3 shows the executable workflow corresponding to the process described in Figure 2), the SPA team has focused on extending the Kepler workflow environment in four areas:
• data provenance
• generic actors
• fault tolerance
• a dashboard environment to monitor the progress of the workflow and simulations
click to enlarge
Turbulence simulation in toroidal geometry Courtesy of C.S. Chang, NYU
The team has extended the original Kepler environment to include a data provenance recorder and has built a flexible provenance data management environment. The recorder captures a user-defined set of actions as they occur during workflow execution (e.g., actor execution, data transfers) and stores it in a database. Also recorded is additional provenance information from external processes (e.g., the simulation code). This addition enables post-workflow analysis to be performed directly against the database. The amount of information collected during a workflow execution can be daunting, with hundreds of thousands of events recorded. To address this complexity, the workflow developer is allowed to easily define which events
• should be permanently recorded
• are only needed while the workflow is running (i.e., for monitoring and fault tolerance)
• can be completely ignored
Based on initial feedback from science partners, the team continues to improve the efficiency and usability of this provenance infrastructure.
Most simulation workflows with which the team is concerned share a similar pattern and thus require conceptually similar actors. Unfortunately, as workflows are defined, minor differences typically appear. These differences can make it difficult to effectively share components among workflows, or to easily move a workflow between two HPC environments. For example, different machines may support different file transfer protocols (e.g., FTP, scp, gridFTP), user credentialing mechanisms (e.g., ssh, certificates) and job submission processes. In order to increase the usability and portability of workflows, the SPA team is developing workflow pattern templates and generic actors. Templates will allow quick prototyping and reduction of the new workflows to known solutions, while the generic actors enhance that by providing an abstract functionality, such as file transfer, across multiple underlying implementations. Furthermore, these actors contain commonly required functionality, such as error handling, which would otherwise need to be explicitly defined in the workflow. By adapting to the execution context, these actors dramatically improve the portability of the workflows.
As HPC computational platforms become increasingly complicated, the importance of having a fault-tolerant environment also increases. Traditional process management tools, such as scripts, require the developer to explicitly handle all failures. However, workflow environments are able to intrinsically provide at least minimal fault-tolerance capabilities. The SPA team has prototyped different fault-tolerance mechanisms, e.g. using provenance information,4 using a check-pointing approach,5 and service duplication.3 In any case, the idea is to keep track of which actor invocations were successful and to avoid costly recomputation of these steps when resuming the workflow after a failure. These ideas are similar to the rescue workflows found in DAGMAN.6 The team is working on further extending the Kepler-based SPA environment to allow the workflow to effectively recover from a broader category of interruptions.
A key challenge facing scientists who have long-running, computationally intensive simulations is real-time monitoring of their jobs to ensure the code is performing as expected. The SPA team uses workflows to populate a portal-like dashboard with images and plots in near-real-time. In addition to allowing the scientists to view simulation output as it is being processed, this interface (shown in Figure 4) allows them to perform comparisons between variables and simulations over time. They are working on extending the supported analysis features and allowing modification of the underlying analysis workflow base. Eventually, it is expected that scientists will use this interface to define and initiate complex workflows using templates and wizards.
Over the past eight years, the SPA team has developed a useful and usable scientific workflow infrastructure capable of automatically performing many of the repetitive tasks currently being done manually. The technology has been successfully applied in a variety of computational science domains including fusion, astrophysics, biology, climate modeling, groundwater and combustion. Wherever it has been applied, adoption of the technology has enabled scientists to focus on their science instead of the complexities of the underlying data management.
Enhancing Scientific Flexibility
Todd Halter and Jim Mather of the Atmospheric Radiation Measurement (ARM) program have been evaluating the Kepler infrastructure to determine whether it can provide the flexibility required by their scientists. The ARM program “Value Added Products” (VAPs) use leading-edge physics codes to derive additional information from instrument data streams. Due to the complexity of these products, they often take years to develop and require multiple iterations. The ARM infrastructure team has been asked to support scientists who would like to replace specific steps in the analysis process with their own, customized physics codes instead of the standard ARM codes. The SPA team worked with the ARM infrastructure team to develop a prototype workflow for one of the more complex VAPs and demonstrated the ability to replace physics codes through the workflow interface. The result was presented at the ARM Science Team meeting in April 2009.
An Early Adopter: Kepler Supports Fusion Center
The Center for Plasma Edge Simulation (CPES), led by C.S. Chang (NYU) (www.scidac.gov/FES/FES_CPES.html) has been an early adopter of several of the SDM Center technologies, including the Kepler environment. The CPES project is developing an integrated, predictive plasma-edge simulation that heavily utilizes the leadership-class computing facility at ORNL, producing over 50 terabytes of data in 24 hours. The CPES End-to-end Framework for Fusion Integrated Simulation (EFFIS) has been successfully performing automated coupled simulation using highly advanced inter-operating computer science tools that include workflow management, fast presentation of image data on Web-based dashboard, increased speed of I/O operations, collection and management of provenance metadata, and wide-area data transfer.
EFFIS facilitates efficient integration of different fusion codes running on multiple and non-local platforms, provides users with real-time dynamic tracking of the simulations progress through graphs and images on a dashboard, and already has been used on tens of thousands of processors in secure environments. The collaboration between the CPES and SDM centers has been extremely beneficial for both teams and highlights the importance of the SciDAC model where computer and application scientists work closely together to advance state-of-the-art computational science.
1. Scientific Discovery through Advanced Computing (SciDAC), www.scidac.gov, accessed June 2009.
2. Ludäscher, B, I Altintas, C Berkley, D Higgins, E Jaeger-Frank, M Jones, E Lee, J Tao, and Y Zhao. 2006. “Scientific Workflow Management and the Kepler System.” Concurrency and Computation: Practice & Experience 18(10), pp. 1039-1065.
3. Vouk, M, I Altintas, R Barreto, J Blondin, Z Cheng, T Critchlow, A Khan, S Klasky, J Ligon, B Ludaescher, P Mouallem, S Parker, N Podhorszki, A Shoshani, and C Silva. 2006. “Automation of Network-Based Scientific Workflows,” Proc. of the IFIP WoCo 9 on Grid-based Problem Solving Environments: Implications for Development and Deployment of Numerical Software, IFIP WG 2.5 on Numerical Software, Prescott, AZ. Printed in IFIP, Grid-Based Problem Solving Environments, eds. Gaffney PW and Pool JCT (Boston: Springer), Vol. 239, pp. 35-61, 2007.
4. Crawl, D and I Altintas. 2008. “A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows.” International Provenance and Annotation Workshop (IPAW) 152-159.
5. Podhorszki, N, B Ludäscher, and S Klasky. 2007. “Workflow automation for processing plasma fusion simulation data.” In Workshop on Workflows in Support of Large-Scale Science (WORKS), 35-44, ACM, New York, NY, USA.
6. Directed Acyclic Graph Manager, www.cs.wisc.edu/condor/dagman, accessed June 2009.
Acknowledgements and key contributors
As part of the SciDAC SDM Center, this work is made possible with the continued support of the DOE SciDAC program. We also have benefited from being a founding member of the Kepler Consortium and from having fantastic access to our application scientist collaborators. The SDM Center is a multi-institutional organization that brings together leading researchers from a number of universities and national laboratories. Key contributors to the SPA thrust area are:
• Arie Shoshani (LBNL): SDM Center Lead PI v • Terence Critchlow (PNNL): SPA Thrust Lead
• Ilkay Altintas (San Diego Supercomputer Center)
• Scott Klasky (ORNL)
• Bertram Ludaescher (UC Davis)
• Norbert Podhorszki (ORNL)
• Claudio Silva (University of Utah)
• Mladen Vouk (North Carolina State University)
Terence Critchlow is the SPA Thrust Area Lead within the SciDAC SDM Center at Pacific Northwest National Laboratory. He may be reached at editor@ScientificComputing.com.
ARM Atmospheric Radiation Measurement | CPES Center for Plasma Edge Simulation | DOE U.S. Department of Energy | EFFIS End-to-end Framework for Fusion Integrated Simulation | vGUI Graphical User Interface | HPC High Performance Computing | LBNL Lawrence Berkeley National Laboratory | vNSF U.S. National Science Foundation | NYU New York University | ORNL Oak Ridge National Laboratory | PNNL Pacific Northwest National Laboratory | SciDAC Scientific Discovery through Advanced Computing | SDM Scientific Data Management | SPA Scientific Process Automation | UC University of California | VAP Value Added Product