My position at Purdue University involves designing high-end research compute systems to enable scientific research, providing consultation for computational research demands, and leading the teams that implement and operate these systems for the University.
Three of the University’s HPC systems are currently listed in the internationally known TOP500 list of most powerful supercomputers. This includes the United States’ largest academic distributed computing grid, called BoilerGrid, and the largest collection of science and medical online hubs. Our researchers are using these HPC machines to store and analyze massive amounts of data.
To ensure the highest levels of support for Purdue’s research initiatives (including an augmented reality system aimed at saving lives during battle, brain trauma research to determine new diagnostic tools for early detection and potential treatment, and detection of dark matter in the universe — just to name a few), our IT group has created and deployed a robust data repository called the Data Depot. The Data Depot is a fast, reliable, high-capacity data storage service operated for Purdue researchers across all fields of study; it is a non-purged space ideally suited for data sharing, file editing, and developing and building software.
Our IT team must ensure uninterrupted access to centralized compute, network and storage resources. Each research project requires ubiquitous access to high-speed storage — necessitating a large, site-wide file system, accessible from a variety of high performance computing (HPC) systems and networks. During the design phase, we examined use cases from several top research areas, including computational nanotechnologies, aeronautical and astronomical engineering, mechanical engineering, genomics, structural biology, as well as several large projects in the life sciences disciplines. The challenge in meeting such varied research needs is accommodating both very large parallel I/O jobs and millions of small, random read requests without imposing performance penalties on any research groups. We needed high-performance storage that could address data volume, velocity and variety.
Any storage solution we implemented needed to be flexible enough to accommodate a range of technical requirements and access methods for a variety of data without slowing down performance. Supporting one particular access pattern couldn’t impose a burden on all the other users; we have technology and connectivity needs that run the full gamut, and we needed to sustain the highest levels of performance for all researchers, regardless of project size or demands.
Following a thorough analysis and detailed procurement process, we determined that DataDirect Networks’ Storage Fusion Architecture (SFA) platform best met the university’s big data needs for performance, scalability, density and reliability. We implemented a pair of DDN SFA 12KX high-performance storage systems with a total of 6.4PBs of raw storage capacity to provide block storage for the Depot’s GPFS parallel file system. Redundant storage arrays were installed in multiple campus datacenters for maximum availability. Additionally, we took advantage of DDN’s Storage Fusion Xcelerator (SFX) technology, which employs advanced caching technology. The most appealing thing about DDN’s SFX technology is that it removes much of the work from the underlying storage. Read requests for millions of small files can be served from cache, which automatically populates with the most frequently accessed data. This allows our IT staff to meet the researchers’ need for performance without the staffing overhead of more manual data placement systems.
With our DDN solution, we’ve achieved a 900 percent improvement in read capability at a low cost, while accessing millions of small files on dedicated solid-state modules as we stream very large data files simultaneously. Certain data queries that used to take two minutes now take two seconds. This performance boost translates to being able to serve nine out of 10 I/O random read requests from the SFX cache. This means that 90 percent of the Depot’s I/Os are served by eight solid-state disk drives, instead of requiring the addition of more than 20 pools of rotational drives to achieve similar performance. Without this system, we would need to add more spinning disk drives just to keep up with the I/O demands of the small files and random read apps.
Moving forward, we plan to maximize our use of DDN’s SFX technology by leveraging the software interface tools to pre-stage our data. If we know of a particularly intense workload where the researchers expect random access from data files residing various cluster nodes, we may be able to pre-stage them into SFX and deliver fast response times.
Currently, about 90 percent of the heavy researchers on campus utilize the Data Depot, with growth occurring across traditional and new research areas. We have researchers sharing large video files with colleagues across campus and around the world. With our scalable storage platform, we can meet not only the needs of the HPC users, but also groups that require high-speed data transfers or large data storage.
Mike Shuey is a Research Infrastructure Architect at Purdue University