Unlocking the Ocean’s Genomic Secrets
Efficient database solution helps J. Craig Venter Institute tap big data analytic power without big price tag
J. Craig Venter Institute (JCVI), a not-for-profit scientific organization, is a world leader in genomic research. JCVI is renowned for the work of Dr. J. Craig Venter, founder and president, and his team in decoding the first draft of the human genome. Today, JCVI’s research spans a number of innovative projects, including the Sorcerer II Global Ocean Sampling Expedition (GOS),1 which has already uncovered more than 60 million genes and thousands of novel protein families from microorganisms found in seawater.
As part of the GOS Expedition, JCVI researchers have traveled around the world, collecting samples from diverse locations, such as deep sea ocean vents; the Atlantic, Pacific, Baltic, Black and Mediterranean Seas; as well as Antarctica. (Related research also is collecting samples found in the air, soil and the human mouth and gut.) These samples are then analyzed against a database that contains previously decoded microbial genomes. Through this process, many new organisms are being identified, and their genomic information has become part of an ever-expanding database of knowledge related to the microbial world. JCVI is using this information to not only better understand the evolution of the oceans and our world, but also to advance existing knowledge of microbial biodiversity, human health and genetic diseases, climate and environmental issues, and even agricultural innovation.
Data that multiplies, and multiplies, and multiplies
At the heart of the GOS Expedition is an enormous volume of genomic data. The team at JCVI uses a metagenomic approach in which the analysis of DNA in an organism is conducted on the entire community of microbes within a sample rather than isolating and culturing each individual microbial species.
First, seawater is collected and passed through a series of progressively smaller filters that capture microorganisms of various sizes. (Single-cell marine organisms, bacteria and viruses, for example.) The samples are frozen and shipped to JCVI laboratories, where scientists extract the genomic DNA of the microorganisms, and then analyze the sequences against databases of known organisms to see if there are any similarities or patterns. Based on the findings, additional rounds of analysis will be done to learn more. For example, scientists might want to look at nucleotide frequencies in the sequences and use the results to associate one of the samples gathered with a specific group of known organisms or proteins. Or they might want to compare sequences from samples taken from one location or time period with the sequences of samples taken from another to better understand the impact that environmental conditions might have on the microorganisms that exist within them.
What this means from a technical perspective is that JCVI must capture, store and analyze very large data sets — data sets that keep expanding in size as more research and analysis is done. While initial analytic data loads may start out relatively small (in the 20 to 100 gigabyte range), secondary processing and analysis can cause the data size to blow up to terabyte levels and beyond. In addition, advances in sequencing technology have enabled researchers to capture much more genomic data overall. When the Expedition first started in 2003, a data set that included 50,000 sequences was considered a large amount. Today, a typical round of analysis may include 40 to 50 million sequences.
Due to the growing volume of data involved, JCVI faced the problem of efficiently and economically storing, loading and analyzing it all with the project’s existing MySQL databases. As the GOS Expedition progressed, and the size of the data continued to multiply, analytic query speed began to suffer. Scientists found themselves waiting hours, and sometimes even days, for results to come in, and JCVI’s database administrator was spending a great deal of time manually indexing and partitioning data (which also increased the amount of data further) and customizing queries so that analysis could be performed.
JCVI determined that it needed to find a data storage and analysis solution that could easily scale up as scientists generated more and more research, as well as one that would not require costly administrative support from IT or expensive hardware upgrades. There was one question remaining, however: did such a solution even exist?
Big data analytic power without a big price tag
All types of organizations today are being challenged by unprecedented data growth and, as a result, need to begin re-thinking their information strategies to accommodate the storage, processing and analytic challenges brought on by “Big Data.” This is especially true in the R&D community, and particularly in genomics, where scientific and technological advances have enabled organizations to generate knowledge at a much faster rate and at a greater volume than ever before.
Adding hardware capacity is one way to support large data loads, but purchasing more and more servers and disk storage subsystems can get expensive, even for companies with substantial IT budgets. As a not-for-profit research institution, JCVI needs to keep project costs in check, so high-priced database technology and expensive infrastructure is simply not an option. This is what led the organization to MySQL, an open source and hence extremely affordable row-oriented database popular in the biological community.
The problem with row-oriented databases, whether open source or commercially-licensed, is that they eventually run out of analytic steam as data size and diversity increases. This is mainly due to the fact that all of the columns associated with each row of data being analyzed need to be accessed when processing a query — even data that’s not relevant to the question being asked. As a result, analysis can slow to a crawl (or just hit a wall completely) and database administrators need to be called in to index, partition or otherwise manually “tune” the data set to improve analytic performance. Eventually, all the manual configuration required becomes a drag on resources, a reality that JCVI was experiencing as the GOS Expedition evolved.
To get big data-sized analytic power without the big price tag, the project team at JCVI began to look at columnar database technologies, which offer a different and more efficient analytic approach.
Changing the analytic perspective
Columnar databases, which have emerged as an alternative to row databases in recent years, store data column-by-column rather than row-by-row. This simple change in perspective — looking down rather than across — means that much less data needs to be retrieved to answer a query, as most analytic queries usually involve only a subset of the columns in a table. This is a key reason why JCVI wanted to transition to a columnar database solution, as it would speed query response times and, thus, speed research progress, even as data volumes continued to increase.
The GOS Expedition team considered the columnar options available and chose an analytic database developed by Infobright. Infobright’s solution was attractive for several reasons. First, because Infobright provides a free-to-download open source edition, JCVI could try out the database at no cost. (The organization has since upgraded to Infobright’s Enterprise Edition, which includes additional features, higher performance and product support while also costing little compared to more traditional options.) The innovative data compression capabilities offered by the solution also were compelling, as JCVI wanted to be able to analyze more data without having to invest in or maintain additional hardware and storage. Finally, the Infobright database is designed to integrate seamlessly with MySQL, which meant that the solution would work well within JCVI’s existing information infrastructure.
With the new database solution in place, queries on genomic data that used to take many minutes (and sometimes hours) to resolve now come back in seconds, resulting in a nearly 10-fold improvement in analytic performance. JCVI also has been able to achieve data compression ratios of 10:1 and in many cases up to 14:1, allowing the organization to speed analysis even more while also cutting down on storage costs. (As an example, one database that was 433 gigabytes in MySQL now takes up only 25 gigabytes of storage in Infobright.) And, from an administrative perspective, the deployment has been a lifesaver. JCVI’s database administrator used to spend dozens of hours per week tinkering with MySQL queries and creating indexes and partitions to ensure that analysis could be performed. Now, data just needs to be loaded into Infobright, with no customization required. Ultimately, this means that JCVI scientists working on analyzing data from the GOS Expedition can ask more questions and change analytic parameters on the fly, because they are no longer limited by the bottlenecks that used to be created when queries became too complicated or required too much set up time to perform.
More questions lead to more knowledge
At the end of the day, an ability to ask more questions is the driving force behind any research endeavor. The GOS Expedition has discovered scores of novel genetic sequences, and by sharing its findings with research collaborators from other institutions, has contributed a great deal to the greater scientific community’s knowledge base. Backed by a scalable and efficient database specifically designed for high volume analytics, JCVI scientists can continue to keep sifting through a sea of genomic data to find more answers and insights about the oceans and our world.
References
1. Expedition overview: http://www.jcvi.org/cms/research/projects/gos/overview/
2. Family breakdown for Volvox_carteri: http://www.jcvi.org/apis/phytax_apis/Volvox_carteri/family
3. A category breakdown: http://www.jcvi.org/apis/phytax_apis/Volvox_carteri/jgi_103424_7099.m000123-Volca1/pdf
Michael Heaney is Database Manager at the J. Craig Venter Institute. He may be reached at editor@ScientificComputing.com.