Data drives the digital world. Much has been written about the pervasiveness of technology in the world and the promise of big data. We’ve all heard the mantra.
All speculation aside, the growing volume of data is a fact, and one that can’t be ignored. International Data Corporation (IDC) estimates the amount of data in the world will reach 163 trillion gigabytes by 2025. Every industry—from transportation to manufacturing, healthcare to consumer products, financial services to research and development and all the others—is looking for new ways to harness and to use the growing volume of data.
Scientists understand that data is the fuel that powers insight, discovery, and innovation. The Institute of Cancer Research (ICR), for example, says big data analytics plays an important role in the discovery of cancer drugs. Scientists are analyzing vast amounts of data—from patient samples, genomic sequencing, medical images, lab results, experimental data, pharmacological data, and many other sources—to help in their efforts.
According to Dr. Bissan Al-Lazikani, head of data science at ICR, more data is better. “The more data we are gathering,” says Dr. Al-Lazikani, “the more patients we are profiling, the smarter the computer algorithms: the better we are becoming at discovering drugs for cancer.”
Large scale growth
According to Illumina, a leading manufacturer of DNA sequencing solutions, it cost $300,000 to sequence a human genome in 2006. Today, with their high-end sequencers, the cost has dropped to $1,000 and with their new generation of machines could eventually drop to as little as $100.
As genomic sequencing has become faster and more affordable, researchers are running more sequencing operations and generating more data. For example, the Swiss Institute of Bioinformatics (SIB), a non-profit founded in 1998, is a leading research organization in Switzerland. Made up of 60 bioinformatics research and services groups and approximately 700 scientists from Swiss schools of higher education and research institutes, the organization is a leader in applying computational methodologies and large-scale data analysis to genomic, proteomic, and other bioinformatic research.
SIB supports projects from active research teams (about 300 currently) at their six different sequencing centers. The organization handles about five separate projects in a week. Data grows rapidly with sequencing runs generating up to 30 terabytes a week.
In another example, GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen), a computing center shared by the University of Göttingen and the Max Planck Society, has seen data volumes steadily grow over the years. Today, the center supports some 40,000 users engaged in research and training, manages billions of files, and stewards about 7 petabytes of data.
For research organizations, the ability to collect and analyze more data is essential to finding breakthrough discoveries. But handling more data has its challenges.
Operating at petabyte levels
Data is not stagnant. It has a lifecycle; it grows and ages. In addition, it must be managed. Once data is created, it must be stored, accessed for computational analysis and collaboration, archived for future use, and protected at every step against the risk of loss. As the amount of scientific data at research institutions grows, these tasks become more difficult.
High performance is important in research. Faster computing power means more data can be analyzed in less time, which can accelerate the research process. Storage infrastructure plays a significant role in the performance of computing environments. High performance requires an infrastructure capable of fast I/O operations without bottlenecks. When storage capacity reaches the multiple petabyte level, maintaining high performance access is a challenge.
Another factor essential at research institutions is collaboration. Technology has made it possible for hundreds of scientists to work together on projects and to share information. But scientists may not use the same client operating platforms or reside in the same locations. Some may use Linux while others use Mac OS or Microsoft Windows. Some may connect to the storage infrastructure via a SAN while others connect via LAN or NAS. Sharing access to data files and research results requires a storage infrastructure that not only supports simultaneous access to data files but also supports multiple access methods and different operating systems.
As storage size grows, data backup procedures must change. When data reaches the petabyte level, traditional data backup operations are no longer able to handle the volume. Still, data must be protected against hardware failures. Installing secondary storage arrays for the purpose of data replication is one way to backup data. But that can be an expensive solution.
Multiple storage tiers
To build a storage infrastructure capable of handling the growing volume of scientific data, research institutions must find ways to blend different storage technologies together. High-performance storage, like flash or high-speed disk, is needed to meet high-performance computing requirements. But at any given time, only a subset of data is active and needs to reside on high-performance media. Storing inactive files on the same media is unnecessary and expensive.
A better approach is to implement multiple tiers of storage. In a multi-tier environment, total storage capacity is broken into different forms of media. There is high-performance disk or flash storage for active files—those files that are part of an active project or are undergoing computational analysis. The remainder of the capacity consists of tape or cloud storage.
Some research institutions have successfully realized this approach. GWDG, for example, uses a multi-tier storage infrastructure. Of the 7 petabytes of data managed by the organization, only 2.5 petabytes reside on disk. The remaining 4.5 petabytes are stored on tape. Since tape storage is more economical than disk, this approach allows GWDG to deliver the performance and capacity needed at a lower cost to the organization.
Data management in a multi-tier storage environment
The process of data management is the key to getting the most benefit from a multi-tier storage environment. As previously mentioned, data has a lifecycle. On average, about 70-80 percent of data files stored are not actively used. As files age or become inactive, they should be moved off of higher priced storage and archived on a lower cost media.
With a complex storage environment, data management can be cumbersome. Fortunately, data management processes can be automated. Policies can be established at the file level, and the movement of files into archive can be done without the researcher even being aware of it. Managed this way, data files look the same from the researcher’s perspective regardless of where they are stored. As a result, files remain visible and accessible when they are needed.
Data management in a multi-tier storage environment also helps ensure data is protected. Leveraging multiple tiers, policies can be established so that critical data sets are copied to another disk array or to another form of media like tape or cloud. That way data is protected and preserved, so it can be restored quickly in case of a hardware failure and the research process is not affected.
Keeping data usable in a high-growth environment
More scientific data is helping researchers to uncover new discoveries. But as more data is generated and data storage environments become larger, research institutions must pay attention to how they manage the growth of their storage infrastructure in order to deliver the best performance possible in the most economical way.
About the Author
Mark Pastor is director of data intelligence solutions at Quantum. He is responsible for driving Quantum’s data intelligence and storage solutions for high performance computing, AI, research and other large unstructured data environments. Mark also represents Quantum within the Active Archive Alliance and in the LTO Consortium.