It Takes Glue to Tango
MeDICi integration framework creates data-intensive computing pipeline
click to enlarge
Figure 1: This chart illustrates a MeDICi workflow provenance and integration. The loosely coupled components allow mix and match solutions.
Biologists increasingly rely on high-performance computing platforms to process the tsunami of data generated by high-throughput genome and metagenome sequencing technology and high-throughput proteomics. Unfortunately, the platforms that produce the massive data sets rarely work smoothly with the interactive analysis and visualization programs used in bioinformatics. This makes it difficult for researchers to exploit the computational power of HPC platforms to speed scientific discovery.
At the Department of Energy’s (DOE) Pacific Northwest National Laboratory (PNNL) in Richland, WA, researchers are creating computing environments for biologists that seamlessly integrate collections of data and computational resources. These advantages enable users to rapidly analyze high-throughput data. A major goal is to shield the biologist from the complexity of interacting with multiple dissimilar databases and running tasks on HPC platforms and computational clusters. One of these environments — the MeDICi Integration Framework — is now available for free download. Short for Middleware for Data-Intensive Computing, MeDICi makes it easy to integrate separate codes into complex applications that operate as a data analysis pipeline.
What is MeDICi?
click to enlarge
Figure 2: Example of a MeDICi integration framework illustrates the incorporation of external HPC resources and databases within the pipeline.
MeDICi is an evolving middleware platform (computer software that connects software components or applications) for building complex, high-performance analytical applications. These applications typically comprise a pipeline of software components. Each component analyzes incoming data and passes its results to the next step in the pipeline. The platform creates analysis pipelines that unite the tools and data needed to support biologists in their analyses. MeDICi provides a set of mechanisms for easily plugging together unrelated, distributed codes and executing the resulting pipeline.
The MeDICi technology starts with widely used, standards-based Java and Web services middleware and layers on top of these tools a simple dataflow-based programming model. This approach has enabled a small PNNL team to build a robust platform in just two years. MeDICi consists of three subsystems
• MeDICi Integration Framework (MIF) — a Java-based, asynchronous messaging platform for application integration
• MeDICi Provenance — a Java API, RDF-based store and content management system for capturing and querying important metadata that can be used for debugging and reconstruction of application results
• MeDICi Pipeline — a BPEL-based environment that integrates with MIF to provide definition tools and a standards-based recoverable pipeline execution engine
The three subsystems function alone or together in any MeDICi application, depending on the needs of the scientists. The MIF is the heart of MeDICi and provides the basic programming interfaces for creating pipelines. MIF leverages the open source Mule Enterprise Service Bus. On top of Mule, MIF imposes a component-based programming model for pipeline creation.
MIF components are constructed using Java programming interfaces that support inter-component communication using asynchronous messaging. Local components execute inside the MIF container. Remote components create distributed solutions and integrate with non-Java code. They support the same programmatic interfaces and use additional MIF facilities to execute component code outside the MIF container.
The default MIF container mechanism uses Java method calls for exchanging messages between components. Use of the Java Messaging Service and other protocols for exchanging messages can enhance scalability — just configure the communication links between components using the chosen protocol.
Mule provides the MIF container environment. MIF extends the Mule interface to make component and pipeline construction easier and to create an encapsulation device for component creation. The MIF interface is agnostic of the underlying Java messaging platform. This allows deployments to configure MIF applications using technologies that meet individual quality-of-service requirements.
Two current PNNL projects that illustrate the use of MIF in building pipelines to solve complex, data-intensive biology problems are featured in the companion articles “Community Proteomics Analysis” and “Aiding Environmental Cleanup.”
The future for MeDICi architecture
The MeDICi Integration Framework is freely downloadable and will soon be open source. The developers expect that by concentrating on simplicity of design and programming, and high performance and robustness based on leveraging standards-based technology, MeDICi will become the glue that makes it possible for many diverse codes and tools to “dance” together.
Ian Gorton is the chief architect for PNNL’s Data Intensive Computing program. Christopher S. Oehmen and Jason E. McDermott are senior research scientists in the laboratory’s Bioinformatics and Computational Biology group, Fundamental and Computational Sciences Directorate. They may be reached at editor@ScientificComputing.com.
1. Cannon WR, Jarman KH, Webb-Robertson BJ, Baxter DJ, Oehmen CS, Jarman KD, Heredia-Langner A, Auberry KJ, Anderson GA (2005) Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data. Journal of proteome research 4: 1687-1698.
2. Fredrickson JK, Romine MF, Beliaev AS, Auchtung JM, Driscoll ME, Gardner TS, Nealson KH, Osterman AL, Pinchuk G, Reed JL, Rodionov DA, Rodrigues JL, Saffarini DA, Serres MH, Spormann AM, Zhulin IB, Tiedje JM (2008) Towards environmental systems biology of Shewanella. Nature reviews 6: 592-603.
3. Oehmen C, Nieplocha J (2006) ScalaBLAST: A scalable implementation of BLAST for High Performance Data-Intensive Bioinformatics Analysis. IEEE Trans Parallel Dist Sys 17: 740-749.
4. Remm M, Storm CE, Sonnhammer EL (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology 314: 1041-1052.
5. Shah AR, Oehmen CS, Harper J, Webb-Robertson BJ (2007) Integrating subcellular location for improving machine learning models of remote homology detection in eukaryotic organisms. Comput Biol Chem 31: 138-142.
API Application Programming Interface | BPEL Business Process Execution Language | DIC Data Intensive Computing | DOE U.S. Department of Energy | EDMS Experimental Data Management System | EMSL Environmental Molecular Sciences Laboratory | MeDICi Middleware for Data-Intensive Computing, | MIF MeDICi Integration Framework | PNNL Pacific Northwest National Laboratory | PRISM Proteomics Research Information Storage and Management System | RDF Resource Description Framework | SCOP Structural Classification of Proteins