In late 2012, the Joint High-Performance Computing Exchange (JHPCE) in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, hit a wall.
Genomics represents a large percentage of its workflows. Prior to 2007, the cost of genomic sequencing had been scaling with Moore’s Law, so computing and storage costs had kept up with the flow of data. But the advent of next-generation sequencing technologies supercharged the scaling law for sequencing (see http://www.genome.gov/sequencingcosts/ ) so that by 2012, the center was drowning in a tsunami of data.
To rebalance the infrastructure portfolio, investment needed to shift from computing to storage. Unfortunately, the cost of Petabyte-scale storage was prohibitive. No vendors could be found that sold NFS storage for less than about $300/TB. Scale-out storage was more than $1,000/TB. With the center’s storage needs doubling every year and NIH budgets essentially flat, the situation looked grim.
To address the crisis, Dr. Fernando Pineda, Associate Professor of Molecular Microbiology & Immunology and director of JHPCE, developed an innovate storage strategy. ”Enterprise storage systems are over-designed,” he says. “They have too much redundancy and too many bells and whistles. We were forced to build our own storage systems. Conventional wisdom says that DIY storage is too hard, too risky and too scary, but back-of-the-envelope cost and performance calculations were compelling and justified the risk.”
His team’s first effort, which went into production in 2013, consisted of a 688 TB NFS network-attached storage (NAS) system that used 5400RPM small-NAS SATA drives and ZFS-on-Linux. It was a hit with the investigators. This was followed by a more ambitious Lustre-over-ZFS project. The Lustre project stretched the expertise of the Johns Hopkins group, so Pineda assembled a team consisting of Bioteam to help architect the system, Silicon Mechanics to integrate the commodity hardware, and Intel to support Intel Enterprise Edition Lustre (IEEL).
“Conventional wisdom says that DIY storage is too hard, too risky, and too scary, but back-of-the-envelope cost and performance calculations were compelling and justified the risk.” – Dr. Fernando Pineda
The initial 1.2PB system that the team developed has the stunningly low cost of $137 per TB of usable storage. The system, which is designated Dirt-Cheap-Lustre (DCL), has two Object Store Servers and has been in production for six months. The cost is roughly 75 percent less than commercially available parallel file systems. Remarkably, over a five-year period, DCL is less expensive than cloud-based archive storage, such as Amazon Glacier. The next 1.8PB increment of Lustre storage is already in the works and will cost less than $100/TB.
Despite its low cost, and with minimal tuning, DCL gets a respectable 55Gbps of aggregate bandwidth under load. “On a per-terabyte basis, it may be the least-expensive and the lowest-power parallel file-system on the planet,” Pineda says.
Pineda estimates that the system realizes a 30 to 50 percent energy savings compared to typical enterprise systems of the same size. Only systems using helium drives approach the low power consumption of DCL, but at nearly twice the cost per TB. To achieve the low power at a low cost, DCL uses Western Digital Red drives in SuperMicro JBODs.
By developing systems that are tailored to a life science research HPC environment rather than an enterprise environment, Johns Hopkins has allowed its investigators to effectively multiply their hard-won grant funds during a period of flat NIH budgets.
System Details The JHPCE facility is optimized for the embarrassingly parallel applications that are the bread and butter of genomics, biostatistics and Big Data analytics. The cluster has more than 2,500 X86_64 cores comprised of a variety of architectures and generations, 20 terabytes of DRAM, and more than 2 Petabytes of networked mass storage. Networked mass storage consists of a combination of older enterprise storage appliances (mostly ZFS), a 688TB NFS-over-ZFS device, and a 1.2 PB Lustre-over-ZFS file system. The network fabric is largely 10 Gbps Ethernet. The facility is connected via a 40Gbps network to the university’s science DMZ. |
Making faster progress on childhood neurological disorders
Dr. Dani Fallin is one Johns Hopkins Bloomberg School researcher who is reaping the benefits of the center’s DCL storage system. Fallin is Chair of the Department of Mental Health and Director of the Wendy Klag Center for Autism and Developmental Disabilities. She applies genetic epidemiology methods to studies of neuropsychiatric disorders including autism, Alzheimer’s disease, schizophrenia and bipolar disorder and to developing applications and methods for epigenetic epidemiology, particularly as applied to mental health and development.
Her research group is focused on childhood neurological disorders. It examines molecular changes that happen in utero, at birth, and as a child develops. Researchers gather data not only on the child but on the mothers and other family members so that they can measure changes to genomics structures over time as they relate to environmental exposure.
The data gathered is voluminous. DNA methylation and RNA sequencing (RNA-seq) data for each sample include up to three billion data points and sometimes double or triple that. Before the new Lustre-based storage system was available, Fallin’s researchers either waited weeks for analyses to run or they chopped data into smaller pieces, ran the pieces in parallel, and then put the results back together. However, this created I/O problems when moving data in and out of storage.
“We’re really excited about these new creative storage and computing tools in JHPCE, because they let us run these data-intensive analyses all at once and get results in hours versus weeks,” Fallin says. “We finally have enough space to hold all this data as our processing moves between large files. We are able to analyze orders of magnitude more data than we were in the past. Without this system, we couldn’t do this research.”
Cost-effective storage for MapReduce on traditional HPC clusters
Abhinav Nellore is a computational biologist in the Departments of Biostatistics and Computer Science at Johns Hopkins University. He and his collaborators, Ben Langmead, Jeff Leek and Chris Wilks, have successfully run their spliced alignment software on the new Lustre system. This software, Rail-RNA, is a MapReduce pipeline that analyzes hundreds of RNA sequencing samples at once, and simply would not run on an NFS-based system.
“In genomics, we often look at large datasets that require a lot of storage and do processing that generates a lot of intermediate data. We need a place to put this intermediate data, and we need it to be fast for many workers in a cluster to simultaneously pull this data, especially for MapReduce jobs,” Nellore says. “We tried running Rail-RNA using an NFS cluster, and the network bottleneck was immediate,” Nellore says. “We couldn’t analyze more than three or four samples at a time.”
When they ran Rail-RNA on DCL, they saw success. “The ZFS/Lustre system made the difference between our code not working and working; it’s that simple,” Nellore says.
This makes Nellore happy, because the availability of a cost-effective, easy-to-use storage infrastructure makes it possible for his team to market Rail-RNA to a much broader research audience. Most institutional clusters are run by administrators who are trained in traditional HPC systems and have little experience with Hadoop. They often aren’t aware of the unique storage needs of genomics researchers. Fortunately, Rail-RNA can be run on both Hadoop and non-Hadoop clusters. So the ZFS/Lustre system can be built as an inexpensive add-on to existing traditional systems, accommodating MapReduce pipelines like Rail-RNA without requiring familiarity with Hadoop.
“The ZFS/Lustre system lets us market our software to a broad set of researchers whose institutions have already invested in HPC systems that aren’t suited to the needs of genomics research,” Nellore says. “They can cost-effectively add it to their existing HPC setup, and it doesn’t change things so radically that people don’t understand it.”
Embracing disruptive technologies
Pineda sees history repeating itself. “In the late 1990s, Linux clusters emerged from the convergence of two trends:
- the development of Linux-based open-source software stacks for networked distributed computing and
- the availability of commodity computing hardware enabled by increasingly powerful low-end microprocessors.
“The resulting cluster technology provided a cost-effective alternative to the proprietary SMP systems that dominated the offerings of HPC vendors. Today, we again see the convergence of two trends:
- the availability of Linux-based open-source software stacks for storage at scale — e.g., ZFS and Lustre — and
- the availability of commodity storage hardware and low-end NAS drives.
Converging trends may again lead the HPC community to embrace disruptive technologies,” Pineda says.
But there are barriers. The lower-redundancy, low-power and low-cost design developed by the Johns Hopkins team may not fit everyone’s requirements. Moreover, BioTeam collaborator Ari Berman has observed that, DIY storage “is not for the weak of heart.” At present, few research-IT organizations have the expertise to develop their own storage solutions. However, the Johns Hopkins team believes that storage is no harder to develop or administer than Linux clusters; it’s just different. Unfortunately, the requisite knowledge is not widely disseminated.
To educate the research-IT community, Johns Hopkins and BioTeam organized a hands-on workshop to introduce participants to the basics of Lustre-over-ZFS storage systems (see http://www.DisruptiveStorage.org). The hope is to foster a research-IT community that embraces open-source/commodity hardware in life-science research environments.
“Business as usual” in terms of enterprise storage did not satisfy the redundancy, cost and performance requirements of the Johns Hopkins School of Public Health. Lustre-over-ZFS, combined with consumer NAS disk drives, represents an alternative model with the potential to disrupt HPC “business as usual” in research environments.
Jane Glasser is a high-tech writer based in Portland, OR.