High performance computing (HPC) workflows were originally found primarily in government laboratories and university environments. Large compute clusters would bang away at enormously complex problems submitted by users as batch jobs, with all compute nodes occasionally dumping the contents of memory to shared data storage. Since jobs could take weeks to complete, this provided protection against node failures by making it possible to restart from a past checkpoint in case of a compute or networking failure. When all is going well, this is a write-once, read-never workload until the much smaller results data is generated. The data storage for this type of workload is usually called a “scratch space.” Because it is temporary data that can be regenerated if need be, the emphasis for data storage for this type of HPC workflow has been on maximizing large file aggregate throughput performance and not on data reliability and availability.
Over the last few years, HPC workflows have become much more prevalent — compute clusters and storage on a scale previously used only by universities and government labs are now commonly deployed by enterprise companies for product design, especially CAE simulation (with software from companies like ANSYS and CD-Adapco), genomic research, seismic processing, and other business-critical use cases. Meanwhile, on the upper end of HPC, government labs have been busy planning for exascale computing (one thousand petaflops in a single supercomputer.) They have a long way to go, seeing as the world’s fastest supercomputer on the Top 500 list, Tianhe-2 at China’s National University of Defense Technology, is “only” at 33.86 petaflops/s. It turns out that both of these important transitions are working together to drive the need for increased reliability and resilience in HPC storage, although for different reasons.
Let’s start with the race to exascale. The reason that the pursuit of exascale computing in government HPC implies a strong requirement for more resilience in data storage is clear when you think at that scale. Moore’s Law states that transistor counts in a dense integrated circuit like a CPU normally double every two years. However, that’s nowhere near fast enough progress to deliver exascale in a few years time at existing component counts — as a result, tens of thousands of CPUs (or possibly GPUs) along with vast counts of drives and/or flash devices will be required to build the first exascale supercomputer. Along with vast quantities of components come vast numbers of failures that are statistically much more likely to hit worst-case failure modes for data storage and other parts of the system.
What about enterprise use of HPC workflows (sometimes also called technical computing)? Data reliability and availability are perhaps even more important in the enterprise. Unlike a scratch space model where most data can be recreated if need be, data in the enterprise HPC case is often reused, accessed frequently over time. As a result, source and output data needs to be protected on an ongoing basis. The more technical computing becomes a core part of creating business value, the more important the data protection becomes.
When it comes to storage solutions available today, look for systems that protect data using erasure codes in software with a de-clustered RAID approach. Hardware RAID can still provide sufficient protection for storage in the hundreds of terabytes, but it becomes less and less reliable at scale. With the right implementation, data reliability can actually increase with scale using erasure codes due to the randomized distribution of files across elements in the storage system. Furthermore, RAID rebuild speeds need to scale linearly with the size of the system. This shrinks the window of vulnerability to multiple failures. Other important considerations include the ability of the storage system to detect and correct sector errors, approaches that reduce the scope of rebuilds to only the data affected, I/O protocol fail-over, and power protection to protect cached data in the process of being written.
Today, open source parallel file systems like Lustre are not well set up to handle many of these requirements, because data protection is not built into the file system, instead having to depend on hardware or software-based block-level RAID. However, proprietary parallel file systems exist that deliver the higher levels of data protection required for enterprise and large-scale public sector deployments. As the storage industry continues to innovate, expect erasure coding for data protection to become the standard across the industry, as it’s clearly superior to hardware RAID in pretty much every conceivable way.
Geoffrey Noer is Vice President of Product Management at Panasas.