As ubiquitous as the term “big data” has become, the path for drawing real, actionable insights hasn’t always been as clear. And the need is only becoming greater as organizations generate greater and greater amounts of structured and unstructured data. While data-intensive computing is not new to high performance computing (HPC) environments, newer analytic frameworks, including Hadoop, are emerging as viable compasses for navigating the complex amounts of data. When applied correctly, these frameworks have the ability to help support critical decision making while saving time and reducing storage expenses.
Today, price performance and scalability are the ultimate combination of factors important to HPC implementations. Various market pressures are driving demand for extremely efficient compute and storage architectures, especially given the tremendous growth potential for HPC environments and the corresponding need to scale to accommodate petabytes and even exabytes of data in the near future. Storing this data is one thing; how do you also derive value and insights?
The answer can be found in newer analytic frameworks like Hadoop, which are beginning to drive strong growth in high performance data analysis (HPDA). HPC applications require analytics-based algorithms for discovering new insights in addition to modeling, simulation and other techniques, and the Hadoop framework is enabling these applications to extract value from vast amounts of data across a number of industries and applications.
For example, in order to improve clinical treatment outcomes for some chronic diseases, like various types of cancer, predictive analytics must be performed on large data sets. As the requirement for advanced predictive analytic methods and tools grows, Hadoop’s computing approach can be useful when applied in conjunction with HPC applications, where massive amounts of cross-disciplinary data must be aggregated into a synthesized result.
An open analytic framework is only half the solution, however. Many Hadoop clusters are still constrained by their default file storage system, the Hadoop Distributed File System (HDFS), which features a distributed architecture with significant performance implications for HPC environments. With HDFS, data is physically allocated to reside on separate compute nodes. In contrast, a growing number of data-intensive commercial and institutional entities are moving to storage platforms based primarily on low-latency, high-performance, POSIX-compliant parallel file systems, such as Lustre.
Parallel file systems like Lustre feature high-speed data access shared equally among thousands compute nodes. Because Hadoop with HDFS architecturally does not support standardized POSIX data sharing by design, however, a vast majority of organizations with existing HPC applications must identify another option for driving insights from their data. As a POSIX-compliant framework, Lustre can provide compatibility support for both Hadoop and HPC application environments, and Hadoop on Lustre is emerging as a viable pathway for achieving this objective.
When implemented, Hadoop on Lustre bypasses HDFS, directing high throughput I/O to the Lustre platform and allowing Hadoop and HPC Lustre clusters to use exactly the same data without having to move it between file systems or storage devices. The result: a pathway marrying Lustre’s efficiency and performance with Hadoop’s powerful data analytics tools, including MapReduce framework for data analysis and other software packages in the Hadoop ecosystem — while still providing the scale and performance capacity today’s HPC environments require. And, because it integrates disparate compute operations and improves overall performance, this approach also speeds time-to-results, improving total cost of ownership, among other key benefits.
Data-intense HPC environments are commonplace, but maximizing value from data is still a rapidly-evolving concept, and Hadoop on Lustre is proving to be a viable vehicle. Investing in an open analytic framework in conjunction with a flexible and scalable storage file system can pay dividends, saving time and reducing storage expenses in the process while enabling organizations to get the most out of their data.
Mike Hoard is Senior Staff, Product Marketing, at Seagate Cloud Systems and Solutions.