Once upon a time, high performance computing (HPC) was mainly about one thing: speed. It started with fast processors, memory, bus fabrics and optimized software algorithms to take advantage of them. We ran FORTRAN-based computational benchmarks powered by LINPACK, which still to this day factors in the TOP500 list of supercomputers. We soon learned of limiting factors such as heat, power and the pesky speed of light. Seymour Cray realized that “anyone can build a fast CPU. The trick is to build a fast system.” We responded with massively parallel systems made up of lots and lots of very fast components. All was good in the world.
So, what happened to HPC? Well, for many computationally-intensive applications, such as simulation, seismic processing and rendering, overall speed is still the name of the game. As long as the funding is there, the race to exascale is alive and well. With systems and algorithms becoming ever more parallel, HPC will continue marching forward the same way it has for the last several decades. Giant monolithic systems will get larger and faster, and, assuming the software architectures can take advantage of them, so will the processing applications.
However, a new branch of HPC is gaining momentum. IDC calls it “High Performance Data Analysis” (HPDA for short). Essentially, it’s the union of big data and HPC. How will these architectures evolve? Let’s start by looking at the data.
Big Data in a Nutshell
By the end of the 20th century, we perfected the methods to store and query large volumes of structured information (e.g. databases). No matter how large the data sets, we learned to index and find specific values almost instantly. Many practical applications we take for granted today would be impossible without these advances in database management. We even witnessed the rise of highly sought-after (and well-paid) professionals known as DBAs (database administrators), whose sole purpose in life was to design systems to organize and manage our information.
Now in the 21st century, the value of data has changed tremendously. While the literal information we query is still important, the hidden value, or insight we can gain from seemingly unrelated, massive sources of data is where the real innovation lies. In a nutshell, this is big data. It’s a combination of large data sets (as its name implies) and analytics to discover new value. In most cases, big data is also “unstructured,” which means the sources we analyze are in vastly different formats. For example, medical history and social media feeds can combine to prescribe healthier behavior.
Not surprisingly, the once indispensable DBA is now witnessing the rise of the data scientist. We’ve built systems to deal with very sloppy, unrelated data sets, thus putting the premium on the analytics rather than the organization and indexing. Data scientists develop new techniques and algorithms to deal with the massive, ever more complex volumes of information coming their way at an increasing pace.
Much like HPC did, big data analytics also ran into barriers right away. Remember, the challenge is not just in the size of the data sets. How do you organize computers to run algorithms on data that is just too complex to analyze using traditional methods? Thankfully for big data, cloud computing evolved along with it. Rather than parallelize the processing, like in HPC, big data applications distribute it. The cloud is a stateless beast that makes no guarantee of uptime or reliability, but in exchange delivers seemingly infinite scale. Therefore, big data analytics are “designed to fail” — we’ve incorporated distribution and replication, along with stateless design that can work around the lack of ACID (Atomicity, Consistency, Isolation and Durability) to take advantage of this. Much like optimizing control flow around loops speeds up algorithms in HPC, big data optimization is all about being able to reduce and process information in independent, distributed pieces. The cloud (more specifically, cloud architecture, whether public or private) facilitates this.
How HPC Helps Big Data
The challenge for big data is no longer about scale, but rather speed. We’ve come full circle. The “time value” of information is critical. Imagine computing our way to the cure for an epidemic only after half the population has died. That’s unacceptable, but thankfully avoidable thanks to HPC. Two of the things that make HPC great can also really help process big data faster: computing density and interconnect speeds. When you factor in popular accelerators and coprocessors, such as FPGAs and GpGPUs, HPC computing units are much denser than their mass market cousins. Sure, you can find multicore CPUs in just about any cloud infrastructure, but what about the thousands of cores available from each of the multiple supercomputing GpGPUs that accompany each CPU in an HPC architecture? In most cloud infrastructures, this is not available. Equally important, if you’re going to distribute processing, you need to make sure that when you do need to communicate, it’s really fast. Otherwise, this becomes a bottleneck, as is the case with popular high latency Ethernet networks in cloud platforms. Whether it’s ingress or periodic synchronization, communication between nodes in a distributed architecture can have a major impact in overall processing speed. Thankfully, we’ve solved this in HPC using InfiniBand networks and other high performance, low latency interconnects.
While we’re seeing new cloud providers offer the staples of HPC on demand, such as GpGPU and InfiniBand, the challenge of applying this to big data is still in front of us. We know information loses value as time passes, which is why we cannot ignore HPC as a critical component of the evolution of big data analytics. But how do we reconcile the sloppy, unstructured nature of big data with the highly organized parallel architecture of HPC?
HPC in the Cloud
Cloud architectures are the answer to bridging big data and HPC. Remember, not all clouds are publicly accessible, so this doesn’t mean that highly sensitive codes will suddenly become Facebook applications. What it does mean, however, is that their architecture needs to become more distributed, less stateful, and more fault tolerant. It is indeed possible to run parallel algorithms on cloud platforms, as long as these algorithms can execute as part of a larger distributed set, operating on their own “slices” of information. When applied to a big data problem, this means the analytics can be far more complex, taking advantage of multiple parallel nodes, accelerators and ultra-low latency message passing thanks to high performance interconnects. But at a macro level, the distributed nature of big data still applies — the individual computing units are just much denser when you add in HPC. Instead of having “instances” running analytics, we will have distributed arrays of clusters operating on their own independent chunks of data (sometimes dynamically provisioned thanks to the new generation of HPC cloud platforms). We will learn to bring the processing to the data, instead of the other way around, in order to avoid moving petabytes (and soon exabytes) of information in order to analyze it. As a result, we will encourage collaboration at massive scale (even among competing organizations).
Cloud architectures will ensure HPDA is much greater than the sum of its parts (big data analytics and HPC). May we no longer measure the power of our systems in double precision performance, but rather in terrible diseases cured, quality of life improvements, national security advancement and customers understood (to name a few ways HPDA can help the world!)
Leo Reiter is a cloud computing pioneer who has been designing, developing, and evangelizing large scale, on demand systems and technologies since the mid-1990s. Currently, Leo serves as Chief Technology Officer of Nimbix, Inc., a global provider of High Performance Computing applications and platforms on demand.