CAMBRIDGE, MA — Intel and the Broad Institute of MIT and Harvard announced at the Bio-IT World Conference & Expo that they are co-developing new tools, and advancing fundamental capabilities, so large genomic workflows can run at cloud scale.
Broad Institute also announced collaborations with cloud providers to enable cloud-based access to its Genome Analysis Toolkit (GATK) software package. This is expected to expand access to the GATK Best Practices pipeline. The new tools Broad is developing with Intel aim to simplify the execution of large genomic workflows such as GATK, and to improve the storage, scalability, and processing of genomic data. This has the potential to not only speed variant detection and biomarker discovery, but enable discoveries that would not have been detected with smaller cohorts.
Broad’s workflow execution engine, called “Cromwell,” is designed to launch genomic pipelines on private or public clouds in a portable and reproducible manner. Broad is working with Intel to extend Cromwell’s capabilities to support multiple input languages and execute on multiple back ends simultaneously, enabling researchers to run jobs anywhere.
This integrated workflow engine has built-in intelligence capable of finding the optimal way to execute tasks, the most appropriate hardware resources to run those tasks on, and methods to avoid redundant steps. “Orchestrating genomic workflows at cloud scale is complex,” said Dr. Eric Banks, Senior Director of Data Sciences and Data Engineering at Broad and a creator of the GATK software package. “We wanted to simplify the execution of common genomic data types like reads and variants and to create an environment that allows any researcher to do this at scale in an easy-to-use way.”
Another area of joint innovation is in the processing and storing of genomic variant datasets, which often consist of large, sparse data matrices. Gene sequence variation data is commonly stored as text files for bioinformatics. The declining cost of DNA sequencing has driven an increase in the volume of genomic data sets that researchers want to incorporate, making it increasingly difficult to jointly analyze large volumes of data from text files. Large scale reads and writes of variant call data, joint genotyping, or variant recalibration require next-generation databases that are built and optimized for genomic data.
Broad and Intel are collaborating on a faster, more flexible, and scalable solution. ‘GenomicsDB’ is a novel way to store vast amounts of patient variant data and to perform fast processing with unprecedented scalability. Built and optimized for the management of genomic variant data, GenomicsDB runs on top of an array database system optimized for sparse data called ‘TileDB.’
TileDB was developed by MIT and Intel researchers working at the Intel Science and Technology Center for Big Data, which is based at MIT’s Computer Science and Artificial Intelligence Lab. GenomicsDB is now used in the Broad’s production pipeline running on an Intel Xeon processor based cloud environment to perform joint genotyping.
“The time it now takes to perform the variant discovery process went from eight days to 18 hours,” Banks said. “However, that’s with 100 whole genomes. We routinely process projects with thousands of samples, so that speedup itself is truly transformative. We recently needed to abandon our attempt to run variant discovery on an eight thousand sample project, because we estimated it would take 90 days without GenomicsDB. With GenomicsDB, however, it should take under a week. This means we can say ‘yes’ to our researchers far more often, on far more ambitious projects.”
“With the integration of these two tools into the genomic pipeline that we are running on a cloud environment, the orchestration and execution of the workflow is not only simplified but significantly accelerated,” said Ben Neale, an institute member at the Broad Institute’s Stanley Center for Psychiatric Research and the Broad’s Program in Medical and Population Genetics. “We are excited that the research community will be able to start testing GenomicsDB and Cromwell.” Intel is releasing TileDB and GenomicsDB as open source tools.
Engineers building the ‘Collaborative Cancer Cloud,’ a precision medicine network including Oregon Health Sciences University (OHSU), Dana-Farber Cancer Institute (DFCI), and Ontario Institute for Cancer Research (OICR) are already using these tools across their collective data sets. Long-term goals are to expand upon these tools to enable joint genotyping with other large genomic research centers in a federated and secure model, regardless of the location of data.
Broad will continue to work with Intel on next-generation computing technologies that address the size, speed, security and scalability challenges associated with large scale genomic sequencing data and analytics. “The progress that we’re seeing in our development work with Broad represents another step in the moonshot goal of taming cancer and other maladies,” said Eric Dishman, Intel Vice President, Health and Life Sciences. “Harnessing and analyzing massive amounts of genomic data may eventually be a key factor in enabling people around the world to live longer, healthier lives.”