Data-driven Science: Supercomputing in the World of Biology
Researchers examine high-resolution screen displays of molecules known as messenger RNAs, one of the precursors to DNA, at the iPlant Developer’s Summit at the University of Texas in Austin in September of 2009. Courtesy of the Texas Advanced Computing Center |
Biologists, computer programmers and information scientists at the UA’s BIO5 Institute have created a computer cyberinfrastructure that enables biological sciences researchers to process immense data sets. What began as an initiative to give plant biologists access to the computing power necessary to analyze the extremely large sets of data that were newly permeating the field of biology, now has expanded to include projects and scientists from all fields of biology and biomedical sciences.
The University of Arizona leads the iPlant Collaborative, which is based at the University of Arizona’s BIO5 Institute. Collaborating institutions include the Texas Advanced Computing Center, the University of Texas in Austin, Cold Spring Harbor Laboratory in New York, the University of North Carolina at Wilmington and Purdue University.
“iPlant is empowering people to use high-performance computing to analyze very large data sets,” said Stephen Goff, the principal investigator and project director of the iPlant Collaborative. “A big function of iPlant is to bring together high-performance computing experts, build a cyberinfrastructure platform, and use it to advance life science research. Life science is more of a data-driven science now than it has been in the past.”
Said Eric Lyons, a senior computational biologist working with iPlant: “We’re at this interesting revolution with biology where we’ve become a much more quantitative science.”
“And we’re now able to generate gigabytes, if not terabytes of data, really easily. In order to get through that much data, you need a lot of computational resources. iPlant is the first major investment by the National Science Foundation in order to build cyberinfrastructure for biologists to allow researchers to handle and cope with all of this information.”
The five-year project received $50 million from the NSF in 2008. The iPlant team engaged with plant scientists from across the nation, as well as with some international scientists, to find out about their research computing needs: what types of data sets they work with and what kinds of questions they ask.
The iPlant team then used this information to develop computer software programs and projects that would be most useful to help the scientists store and process their data.
iPlant is about building cyberinfrastructure for life sciences, said Lyons: “Cyberinfrastructure is the essential ‘plumbing’ that we need in order to hook together different kinds of computational resources to make it easy for biologists to manage large amounts of data in terms of getting it someplace, keeping it organized, sharing it with their collaborators, and analyzing it to make scientific discoveries.”
Traditionally, supercomputing centers have worked mainly with the physical sciences: modeling and dynamics of fluids, ocean currents, atmosphere, climate change, geological processes like earthquakes and fault-line stresses.
“They have advanced visualizations to deal with massive data sets, and now computer science experts are working together with biologists to develop software that makes it easy for biologists to store and analyze their data,” Lyons said.
At the UA, a core group of software developers and computer science system engineers work in collaboration with engineers and scientists at XSEDE (Extreme Science and Engineering Discovery Environment), an NSF-funded supercomputing center that deals specifically with processing large amounts of data.
iPlant has developed ways for scientists to share information through software over the Internet and virtual servers created to store huge amounts of data.
“Let’s say, for example, that you’re a scientist and you’re working at a site that manages a huge amount of data,” said Lyons. “And you want to take a portion of your data and easily send it somewhere else to be processed, add more data to it and have it sent back, but have it all happen automatically. At iPlant, we leverage all the different technologies that we have, and evaluate which new computational technologies are going to help solve particular problems.”
Said Matthew Helmke, who is a senior technical documentation specialist for iPlant: “We’re also offering novel ways to interact with iPlant’s computational systems.”
“People can write their own software to interact with our systems. So, you can go from: ‘I’m scared of the computer, but I’m a biologist and I have data and I want to do something with it, can you help me?’ to the other extreme of: ‘I’ve been doing computational analysis since the 1980s; you just have really cool resources that I don’t have, can I tap into those resources with my own analysis programs?’ And I think we will make people happy coming from both extremes.”
One of the flagship products is the iPlant Discovery Environment, which is a Web interface.
“Let’s say you have something that compares evolutionary relationships or traits with a common ancestral origin,” Helmke said. “You can run the appropriate analysis on your own desktop system, and it will work reasonably well, but the analysis might take a considerable amount of time to complete. What if you could run the exact same program with the exact same parameters but have it leverage high performance computing resources, and still do it from your desktop using a Web interface? That’s what the iPlant Discovery Environment allows researchers to do.”
The iPlant Discovery Environment gives researchers a way to store data, add new software that they design for their experiments, and collaborate with other scientists using that new software. They also can allow other scientists to use their software in the future.
“Once the data is in the public iPlant Data Store, and the software tools are made available in the iPlant Discovery Environment, they become available for other researchers to use,” said Helmke. “You can also keep parts or all of your data and analysis tools confidential. Your data can be kept completely private or you may share it as you like, but if someone wants to replicate your experiment, how do they do that? The iPlant Discovery Environment and Data Store allows other researchers to use the exact same environment to replicate specific experiments.”
One project iPlant is working on is the Taxonomic Name Resolution Service, or TNRS, a software system that compiles different classification schemes and creates links between the different systems, so that scientists can search for an organism’s classification according to all of its classifications. This is important because many entries in collections can have multiple names, and the TNRS helps resolve naming discrepancies.
Said Shannon Oliver, a technical documentation specialist for iPlant: “It’s really hard to push forward collaborative research without having some sort of standardization across the different studies. How do you collaborate when everyone’s using a different name for different plants and how do you know that the data is still applicable across these different species?”
The iPlant Collaborative projects also have major educational components, funded by the NSF, BIO5 and Science Foundation Arizona. Students and teachers can access the iPlant technology resources and open-source data, and there are educational tools designed to help K-12, undergraduate and graduate students understand the data, as well as how to use the computational tools provided by iPlant.
The main component of iPlant is it’s collaborative nature, said Goff: “It’s not about one person’s research or even a small group’s research. It’s collaborative across plant biology disciplines, ecology, functional genomics, molecular genetics and evolution. There are many problems in biology that are beyond the scope of a single research lab, but within scope for multiple labs that have different levels of expertise in specific areas. iPlant is designed to empower collaborations across disciplines and facilitate major discoveries.”