Special Chips for Specific Purposes: A look at acceleration technologies for scientific computing
Rediscovering the advantages of the hybrid approach
The analogy between hybrid cars and hybrid computing is almost perfect: Combine two kinds of engines and use each in the situations for which they perform best. The result is energy savings…or, higher performance for the same energy cost. For both cars and computers, energy efficiency is becoming the primary issue. For hybrid
computing, the two kinds of engines are a general-purpose processor (like those made by Intel and AMD) and a coprocessor specifically crafted for certain workloads. The workloads might be network traffic handling, graphics display or high performance, high-accuracy scientific calculations.
The idea of adding special-purpose processors to accelerate general-purpose hosts is causing considerable interest and excitement in scientific computing. The key is to understand what each accelerator technology can and cannot do. Properly applied, they may get us to petascale scientific computing much sooner than can massive collections of unaccelerated general-purpose servers.
Visible and invisible coprocessors
Computers have long used coprocessors in ways that we do not think about, because they function automatically in dedicated ways. Think of a disk drive controller, or the processor that controls a printer, or even the one that manages the keyboard. Graphics cards handle the task of turning geometric primitives into screen images. They have more processing power (and more transistors) than the largest mainframes of the 1970s, yet we do not think about them as coprocessors. These are invisible coprocessors.
Accelerators for scientific computing are visible coprocessors, at least for now. If you use a field-programmable gate array (FPGA) or a general-purpose graphics processing unit (GPGPU) to handle part of a technical computing job, you will be using a programming environment to specify what it does and how it coordinates with its host processor. As software environments mature, scientific coprocessors may become more like the invisible kind by intercepting dynamically linked library calls to math-intensive functions and just accelerating them. This approach is already working well for some technical applications.Since coprocessors for scientific applications are still visible to the application programmer, the issue is to fit the right hardware to the job. All accelerators are good…for their intended purpose.
Power-efficient design
What is it about scientific computing that we can exploit to increase speed and save power? Sometimes we see studies at the machine instruction level that show business applications and scientific applications looking very much alike in their requirements, but that is a fallacy. The apparent similarity is merely the result of forcing both applications to run on identical hardware. In addition, the significant differences in how each application stresses the memory are difficult to glean from a simple survey of the op codes. Scientific algorithms are profoundly different from the kind of algorithms used for, say, compiling C++ code or maintaining databases.
Scientific computing tends to have memory access patterns that are static and known at compile time. In solving equations or computing forces or iterating to converge a result, the memory access pattern usually does not depend on the actual values very much. That means we can minimize the control logic that takes up a very large fraction of the processing hardware. That saves power and space. Techniques like “scout threads” and “speculative execution” have been powerful ways to keep the old serial-thread programs improving for a few more years, but all that guessing uses a lot of wattage, especially when the guesses turn out to be wrong.
Another thing we can save on is automatic caching. Scientific computing certainly needs local memory that is faster than main memory, but we can manage it explicitly since we know what has to go
where. Automatic caching brings in an entire cache line when an array access pattern might repeatedly need only one word in the cache line, so it causes more harm than good to scientific code performance. Dragging all those bytes to and from the main memory wastes electricity, even if the architecture fully supports the wide transfers.
It really becomes apparent how different general-purpose computing is from scientific computing when we consider multi-core architectures. For general computing, we need designs that share memory with guaranteed cache coherency, and that allow each core to run instruction streams independently of the others. In contrast, for scientific computing, you can usually distribute the data structures in a planned way so that the hardware can be much simpler and more power-effective. Furthermore, single-instruction, multiple data (SIMD) control works very well for the kind of array constructs common in scientific programs, at least up to a few dozen processors. Having to manage separate instruction streams adds to the programming burden of the application programmer, consumes a great deal of space on the chip, and burns more electric power.
Specialized graphics processors for scientific jobs
Graphics processors (GPUs) are even more restricted in the types of workloads they process and, hence, can specialize even more than scientific coprocessors. For instance, the floating-point operations they do require only single precision and can get away with the crudest form of rounding (truncation toward zero) to save space on the chip. The streaming nature of graphics operations also allows for a design where data does not stay around for very long, so GPUs need little memory and not much sophistication in how to access it.
Just as it would be a mistake to try to use a scientific coprocessor to run standalone Linux or to process real-time transactions, it would be a mistake to use coprocessors designed for graphics as engines for physical simulations and serious engineering tasks. Some graphics coprocessor vendors are talking about having it both ways, adding more control and precision to their GPUs so that they can play in both the graphics and scientific markets (GPGPUs). But it’s the usual story: doing so adds cost and power consumption for features that their main customer base doesn’t need and certainly doesn’t want to pay extra to support.
Summary
The trend is toward special chips for specific purposes. We used to take this approach to save transistors. As transistors became less expensive, we instead combined all functions into a single, general-purpose commodity processor. Today, the limit on scientific computing capability is coming from a very different direction: power consumption and heat dissipation. Thus, we are rediscovering the advantages of the hybrid approach where systems contain optional coprocessors that let us make the best possible use of every kilowatt-hour on our power bill.
Coprocessors for scientific computing are of the visible type now. However, as they evolve, we will take them for granted as accelerators that run automatically in the background, speeding fundamental kernel operations just as graphics processors speed fundamental video processing tasks today.
John Gustafson is Chief Technology Officer for HPC at ClearSpeed Technology. He may be reached at [email protected].