Make No Little Plans
Enabling an open, evolving framework for integrative modeling and data management
Ongoing advances in scientific technology have left us confronted with the task of discovering scientific knowledge from enormous amounts of data generated in the life and environmental sciences, physics and sociology, business and medicine. In biology, these data are frequently combined from many sources. Further, addressing large-scale interdisciplinary problems requires diverse research teams. Researchers in biology, particularly in computational biology and systems biology, already have experienced the shift from the one-scientist-one-project paradigm to cross-disciplinary collaborative research, and the need for an infrastructure and community that can support it all.
Shifting roles and needs
It has become commonplace to hear about the massive amounts of heterogeneous data being generated, particularly in the biological sciences. Datasets are more often than not incompatible, incomplete and redundant, and have varying degrees of compliance with metadata standards. Along with data are provenance and workflows, which also must be captured to understand how an analysis was performed, to ensure comparability of results with other analyses, to know how to repeat analyses, and to determine how to build on top of analyses that might have been performed by someone else.
Data integration is likewise difficult, time-consuming and generally unpleasant. In biology, there are scores of data types in biology, many different tools, and numerous analysis output formats that are incompatible. Data manipulation to integrate heterogeneous data is a routine part of research for countless interdisciplinary investigators. In Jim Gray’s last talk in 2007, he observed, “When you go and look at what scientists are doing, day-in and day-out, in terms of data analysis, it is truly dreadful.”
This is especially true for projects not tapped into large analysis pipelines. Now, ever more researchers are becoming disconnected from these large data analysis pipelines, which have traditionally been tied to the large data-generating centers. In the future, sequencing data will come from many distributed sources, not just from large-scale centers. And many of these new data sources do not have large-scale infrastructure to conduct analyses.
Until now, implementing new algorithms for analyses has required specialist know-how, sometimes a complete rewrite of the code or hiring someone new to do it, making forward-looking tool creation and maintenance particularly difficult for many biologists without access to data analysis pipelines. However, using pipelines in large data centers hasn’t been without issues; researchers desiring to depart from datacenter workflows for a step or two in order to use a different tool or algorithm have not been able to do so, as the tools have been locked into the pipeline. As the shift from data center to widespread data generation continues, we must confront an important question: Where and how will the community conduct data analyses?
Sharing
A larger fraction of the data is not quickly finding its way into the public archives. We must find ways to lower the barrier to sharing and collaboration in the community. Sharing must become easier. The research community will need to come together like never before, but how can this be done? We cannot scale up our questions without access to diverse data types and without connecting people with equally diverse research backgrounds and objectives. In addition to collaboration, there is another reason to share: cost reduction. Sharing pertains not only to data, but to data analyses, too. As the cost of data generation continues to drop rapidly, the volume of sequence data generation soars, and so does the amount of data interrogated to pursue research questions. Yet, analyses and transformations on large datasets are becoming prohibitively expensive. It can cost millions of dollars to apply methods developed for much smaller datasets on datasets that are orders of magnitude larger. The historic approach does not scale well, and this is leading to an imbalance between the cost of collecting data versus the cost of analyzing data.
For sharing to work, the community needs a structure that can
1. support data integration
2. allow the community to have continued access to a data pipeline that can accept their new heterogeneous data, or let them easily access existing contents of common databases
3. allow researchers to continue using their favorite tools, and to build and modify new tools without having to be a crack computer programmer
4. at the end of the day, have everything in a compatible format that can be securely shared, commented on, rated for quality, and built upon; and all of this must
5. drive us to create better models, predictions, and conduct more appropriate, less redundant experimentation.
This is a lengthy and ambitious wish list. Yet, at a high level, this list represents our plan.
Crossing domains
New scientific questions are best answered by computing across domains (microbes and plants, microbes and communities, communities and plants). We might ask, for example: How can we produce a model of a whole cell that includes all the molecules required for life and their interactions? or How will an organism or a community respond to future environmental stress?
Diverse, interdisciplinary teams require a flexible system that is organized the same way they are. Every day, certain problems increase in scale; this is especially true of the newly emerging large-scale ecological surveys. Researchers have moved from the one-scientist-one-project paradigm, just as they have moved from the study-one-organism to the study-one-ecosystem paradigm. Questions now regularly span complex biological and environmental systems over many spatial and temporal scales, from molecular to global and from nanoseconds to centuries.
Considering science problems
To illustrate our approach, consider a few science problems and the corresponding community needs:
- Design an organism to produce new chemicals, such as are useful in the production of bioenergy and biofuels.
What would it take to do this? Researchers must reprogram and rewire biological systems by introducing new functionalities. The raw materials needed to accomplish this are accurate annotations, the reconstruction of metabolism and regulation, the integration of ‘omics data, and construction of models of genomes. But our ability to confidently assign structural and functional gene annotations has not kept pace with the data generation. High-quality gene annotations with confidence measures are a critical component of all genome scale modeling.Efforts to create genome scale regulatory and metabolic models are held back by the poor quality of existing gene models, the lack of interoperability of models and data, the high degree of mathematical and computational expertise required to utilize models, and an inability to rapidly build new models that accurately capture the complete body of our current biological understanding. We need to be able to combine ‘omics data and modeling algorithms, making it possible to easily cross-validate data and models by comparing predictions with experimental observations.
- Discover valuable proteins in theenvironment, such as enzymes with novel properties.
Microbial diversity is a key element in this search. One way we could reveal novel proteins from microbial communities would be by integrating the functions from one tool (e.g., metaMicrobesOnline) with the data housed in another (e.g., the Metagenomics Rapid Annotation Using Subsystems Technology, or MG-RAST, server). This combination would permit deep comparative analysis of protein families and allow in-depth characterization of novel members of existing protein families and, eventually, the characterization of completely novel protein families. - Identify genes that impact biofuel production.
In order to do this, we must build tools for data exploration, and link gene targets from phenotype studies, such as genome-wide association studies, with co-expression, protein-protein interactions and regulatory network models. This type of data exploration would allow users to narrow candidate gene lists by refining targets, or be able to visualize a sub-network of regulatory and physical interactions among genes responsible for a phenotype. Users also would benefit by highlighting networks or pathways impacted by genetic variation. Researchers need to be able to explore across multiple experiments and diverse data types. And they need access to comprehensive datasets from high-throughput experiments, together with relevant analytical tools and resources.
Planning solutions
The solution currently being constructed is the Department of Energy (DOE) Systems Biology Knowledgebase, or KBase. It is a collaborative effort designed to accelerate our understanding of microbes, microbial communities and plants. It will be a community-driven, extensible and scalable open-source software framework and application system. KBase will offer free and open access to data models and simulations, enabling scientists and researchers to build new knowledge, test hypotheses, design experiments, and share their findings to accelerate the use of predictive biology.
The project has two central goals. The scientific goal is to produce predictive models, reference datasets and analytical tools and demonstrate their utility in DOE biological research relating to bioenergy, carbon cycle, and the study of subsurface microbial communities. The operational goal is to create the integrated software and hardware infrastructure needed to support the creation, maintenance and use of predictive models and methods in the study of microbes, microbial communities and plants.
The KBase will contain a data management system, called Shock, built to house large-scale sequence data sets in order to provide an ecosystem supporting effective reuse of the results of expensive analysis. Shock is a hybrid of a NoSQL database integrated with a bulk storage system. At a high level, it provides a data catalog for large-scale data sets. In this, it supports the annotation of data sets with metadata — describing the data, as well as the provenance of the collection and analysis processes that produced it. In addition to these basic functions, it also supports direct queries and server-side data manipulation and reduction of data based on pre-computed feature location indices. As the server provides scalable access to raw data, it provides a basis for distributed analysis of sequence data as well. One highly desirable benefit of our new system is that it will store, retrieve, query and filter large sets of scientific data.
Earlier, we mentioned three science problems. These problems, and input gathered from the community, have strongly influenced the design of the KBase, and have resulted in solutions that would benefit any member of the biological research community who wishes to use the KBase. The solutions we are offering the community include:
- Workflow engines that can use scalable cloud-hosted services
KBase will be supported by a computing infrastructure based on the OpenStack cloud system software, distributed across the “core” sites. Enabling cloud computing will create new opportunities that range from rapid deployment of developer environments to highly scalable production servers. This will permit bioinformaticists to chain together complex workflows to generate, summarize and integrate data that feed into biological models.KBase will offer a flexible infrastructure. Further, users can opt to download an original, portable, virtualized computing environment and data, and perform a data transformation using a user-defined workflow. Tool selection is from both established and new tools built by the community.
- Structured and unstructured data
integrated into a coherent data storage model (relational and NoSQL databases and bulk store)
KBase will store a diverse representation of biological data ranging from highly structured data in relational databases, to large bulk data, to frequently generated and changing user data. This entails multiple approaches.Carefully planned storage solutions include a Bulk Distributed File Storage (ADM), Persistent Store (Users), and a Central Data Model (CDM). These stores serve as a connecting layer between the core services (popular and established databases and tools) and the KBase Unified Application Programming Interface (API), which is the primary programming target for KBase tools and applications, and provides services (integration services, low-level system services, and user interface and graphics).
- A unified applications programming interface enabling cross coupling and reuse of tools and algorithms
KBase will be composed of a series of core biological analysis and modeling functions, including an API that can be used to connect different software programs within the community. These capabilities will be constructed from the popular analysis systems at each of the KBase sites. Their integration into KBase will combine individual functions to create the next generation of biological models and analysis tools.
The KBase Unified API also will enable third-party researchers from our diverse community of users to design new functions. Its development is based on a service-oriented approach to deliver both functionality and data to the community. In this type of approach, the system is functionally decomposed into services, each of which is implemented as one or more servers.
The API provides a single view to all of the capabilities of KBase. Initially, these services will be developed by the KBase infrastructure team and will support a long-term goal of community-developed and contributed services. Our initial set of services will be backed by many servers: genomic, expression, protein family, polymorphism, phenotype, compound and reaction, metabolic models and regulatory models.
We have initially targeted applications that will have eventual applicability across all KBase science domains. We will have a common software infrastructure that is driven by the science, shared hardware driven by the data production rates of the community, and the data and computing based on the needs of the science domains. We will establish common modes for drawing and depositing data from established archives as needed. The interface also will serve as a critical link the community can use to access planned beta, version one, and future community-developed tools that add value (i.e., predictions) to their data and science questions.
The KBase environment provides a seamless presentation to the user. While we are leveraging the considerable work of many scientists and developers for the core services on which the KBase is built, the overarching objective is to provide a solid platform that supports predictive biology in a framework that does not require users to learn separate and varied systems in order to compose and execute the complex tasks that constitute many of the future workflows in systems biology.
What lies ahead
Recently, many have issued a call to action for the biological community to regroup. Much has been said and written about how arduous the path will be, as it requires a huge diverse community to come together, and even in some instances to rethink how we are conducting and defining science. We are careful but hopeful in laying out our ambitious plan.
Daniel Burnham said “Make no little plans…” We agree.
It has become clear that we will need to develop a rich, highly connected culture to face these problems and the shifting values in our ecosystem. Initially, many within this new culture can mobilize and help to build an infrastructure capable of supporting the newly emerging paradigm. We suggest considering KBase as that place to begin.
Jennifer Fessler Salazar is a science writer at Argonne National Laboratory and a Research Associate with the Field Museum’s Department of Zoology. Narayan Desai is a principal experimental systems engineer at Argonne National Laboratory. Folker Meyer is a computational biologist at Argonne National Laboratory, a senior fellow at the Computation Institute at the University of Chicago, and an associate division director of the Institute of Genomics and Systems Biology. They may be reached at
editor@ScientificComputing.com.