The HPC and enterprise communities are experiencing a paradigm shift as FLOPs per watt, rather than FLOPs (floating-point operations per second), are becoming the guiding metric in procurements, system design, and now application development. In short, “performance at any cost” is no longer viable, as the operational costs of supercomputer clusters are now on par with the acquisition cost of the hardware itself. Energy cost is the reason why modern data centers and supercomputer centers are being located where energy is inexpensive and environmentally friendly (i.e. hydroelectric) whenever possible. As Jeff Goodell observed, “coal was supposed to be the engine of the industrial revolution, not the Internet revolution.” Regardless of motive — practicality, altruism or profit — procurement teams, system designers and application developers must learn to optimize the energy efficiency of their hardware and application software.
The January 2015 ACM Computing Surveys paper, “A Survey of Methods for Analyzing and Improving GPU Energy Efficiency” by Sparsh Mittal and HPC Luminary Jeffrey Vetter found a general hierarchy of device energy efficiency relative to application performance in the literature starting with FPGAs as the most efficient, GPUs, and then CPUs as the least efficient. The authors also made the general observation that “a single platform cannot be accepted as most-energy-efficient for all possible applications.”
ENTRIES OPEN: Establish your company as a technology leader. For 50 years, the R&D 100 Awards, widely recognized as the “Oscars of Invention,” have showcased products of technological significance. Learn more.
The challenge with this hierarchy is that FPGAs are considered to be extremely hard to program, while CPUs and GPUs are among the easiest. Recent advances in OpenACC and OpenMP4 compilers by PGI, PathScale, and the soon to be released GCC 5.0 are helping to blur the distinction — and effort — between programming a CPU and a GPU, while FPGAs can be programmed in OpenCL.
Even with the ability to write OpenCL programs for FPGAs, programming these devices is not for the faint of heart. A long-standing joke in the technical community is that people can be found in the hallways yelling and screaming when a conventional systems crashes, while people can be found in the hallways yelling and screaming when an FPGA application works.
Power efficiency is the reason why organizations like Microsoft’s Bing are investing in FPGA-based codes and why Mittal and Vetter suggest the possibility of including FPGAs in future HPC systems. Microsoft Research recently touted a conservative estimate that an OpenCL-programmed implementation of convolutional neural networks (CNNs) achieves a 3x the performance-to-power advantage over NVIDIA GPUs when running on new FPGA hardware. CNNs are now used throughout the industry for image recognition, image classification, self-driving cars and many other applications. For Bing, reducing power consumption from 235 watts on a Tesla K40 to 25 watts using an FPGA on continuously running CNN jobs can be a big energy saver, which translates directly to money saved and greenhouse gas emissions eliminated.
For HPC, Mittal and Vetter have taken the position in their paper that further research might include the use of FPGAs in future HPC and exascale systems.
The technology world is rapidly evolving. For example, the NVIDIA X1 SoC (system on a chip) can deliver a teraflop per second of floating-point performance in a 10 watt TDP package — potentially bring CNNs power efficiencies down to the FPGA level. However, the X1 is so new that no performance numbers on CNN or other neural networks have been published as yet.
To account for innovation, Mittal and Vetter made the following three caveats in their paper:
- Stacked 3-D memory can be a game-changer for GPU energy efficiency. Specifically they wrote, “It is expected that leveraging the benefits of 3-D stacking and NVM would be a major step in improving the energy efficiency of GPUs.”
- The efficiency results crucially depend on the devices and evaluation methodology used in the experiments. The challenge is that researchers need to spend the time to fully optimize both the CPU and GPU codes, “researchers who apply careful optimization on both CPUs and GPUs have reported much lower speedups of GPUs over CPUs, typically in the range of 0.7× to 15×.“
- The use of virtualization technology may greatly increase utilization and reduce GPU idle time energy consumption.
In particular, stacked 3-D memory has the potential to redefine computing performance and efficiency at all levels from hardware to software and algorithm design. The new Intel Knights Landing processors will include on-package HBM (high-bandwidth memory) that will be 5x the bandwidth and power efficiency of DDR4 memory. Similarly, the NVIDIA Volta architecture will also include stacked memory. The significantly higher bandwidth and greater energy efficiency of these new memory systems will remove two big performance and efficiency obstacles for both enterprise and HPC applications. The potential upside is very exciting.
For most scientific and enterprise applications, FPGAs are interesting, but they cannot compete with the general programmability of both CPUs and GPUs. Enterprise customers may very well find the efficiency benefits of FPGAs compelling for frequently run applications where the power savings justify funding the software development.
Still, the breadth of applications that report better energy efficiency from FPGAs in the scientific and technical literature reported by Mittal and Vetter is eye-opening. Following is a brief list from the paper:
- BLAS, 2-D FIR (finite-impulse response) filters
- Viola-Jones face detection
- K-means clustering
- Monte-Carlo options pricing
- random number generation
- some aspects of N-body calculations
- 3-D ultrasound computer tomography
- the Smith-Waterman algorithm
It is worth noting Mittal and Vetter reported that GPUs were the clear winner when running FFTs (fast Fourier transforms) over both FPGAs and CPUs. FFTs are ubiquitous throughout scientific computing.
New hardware innovations will likely be the dominant consideration for future energy-efficient HPC designs and procurements. For example, the Green 500 pronounced the new Intel Knights Landing (KNL) chip as, “the most power-efficient parallel processor in the world.” The Knights Landing processor provides both out-of-order execution via the Intel Atom Silvermont processing cores and the ability to run self-hosted as an SMP processor that does not require offload programming. NVIDIA has been quiet lately, but we can see the potential benefits of the 100 GF/s per watt of the Tegra X1.
In terms of programmability, look to the potential of OpenACC and OpenMP4, as programs written using these open specifications can be compiled to run on NVIDIA GPUs, AMD GPUs, ARMv8, x86, Intel Xeon Phi and potentially other architectures, such as Power 8.
To keep a sense of perspective on energy efficiency, we should all be humbled as the human brain only consumes 12 watts of energy per hour. Watson, the IBM system that played Jeopardy, utilized 750 servers, each of which consumed 1 kW of power, or roughly 750 kW total during the show.
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at [email protected].