Big challenges lie on the software side in efficiently corralling hardware’s runaway parallelism
|Three real-world applications have broken the petaflop barrier, all on the Cray “Jaguar” supercomputer. However, the vast majority of HPC applications were originally written to be run on a single processor with direct access to main memory. Courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory|
At an IDC HPC User Forum meeting in 2005, Paul Muzio, director of the HPC Center at City University of New York, said that an application engineer’s ideal petascale supercomputer would have one powerful processor with direct access to main memory. By contrast, a computer scientist’s ideal petascale system would have hundreds of thousands of processors of different types and an innovative, distributed memory architecture.
“Unfortunately for many users,” he added, “the computer scientist’s system may be the one that’s built in the near future, as the technology to do otherwise does not exist. The challenge will be to build this kind of system but make it look like the kind the applications the software engineer wants.”
Fast forward five years to today, and Paul Muzio’s expectation is manifest. The first petascale supercomputers — now deployed in the U.S., Germany and China — employ more than 100,000 processor cores each, multiple processor types in most cases, and distributed memory architectures.
Where do HPC applications stand in relation to all this potential processing power? To date, three real-world applications have broken the petaflop barrier (1015 calculations/second), all on the Cray “Jaguar” supercomputer at the Department of Energy’s Oak Ridge National Laboratory. A slightly larger number have surpassed 100 teraflops (1012 calculations/second), mostly on IBM and Cray systems, and a couple of dozen additional scientific codes are being groomed for future petascale performance. All of these applications are inherently parallel enough to be laboriously decomposed — sliced and diced — for mapping onto highly parallel computers.
But, the vast majority of HPC applications were originally written to be run on — you guessed it — a single processor with direct access to main memory. Most codes have not progressed far beyond that and are only modestly scalable today. An IDC survey showed that more than half of the codes (56 percent) do not scale beyond eight way, and only about six percent can exploit more than 128 processing elements. That’s a tiny fraction of the hundreds of thousands of cores on today’s biggest supercomputers, not to mention the millions of cores their successors are expected to contain in just a few years.
There are other issues with today’s multicore HPC systems:
• First, to save energy and control heat, more multicore processors, on desktop computers and supercomputers alike, tend to have tuned-down speeds. Instead of the 3 to 4 GHz single-core processors of the past, it’s increasingly common to see multicore processors in which each core runs at about half that speed. This presents problems, especially for applications that can only exploit one or a few cores, as is not uncommon in the design engineering sector. In yet another IDC survey, 12 percent of the HPC sites said some of their applications were running more slowly on their newest HPC system than on the prior one. More than half of the respondents said they expected to encounter retrograde performance like this on some applications within the next 12 months.
• Second, clusters based on standard x86 processors from Intel and AMD are the dominant species of HPC systems, capturing about two-thirds of worldwide HPC server revenue in 2009. But, as standard x86 processors have gone from one to two cores, and then from two to four and beyond, they have multiplied their peak performance (flops) without corresponding increases in bandwidth, even though this situation has improved somewhat recently. The relatively poor bytes/flops ratio of x86 processors has limited cluster efficiency and productivity by making it more and more difficult to move data into and out of each core fast enough to keep the cores busy. Adding accelerators via a slow PCI bus adds another dimension to the balance problem. Both hardware and software advances are needed to improve sustained performance on future multicore, x86-based cluster systems.
• Finally, massive hardware parallelism from skyrocketing core counts and system sizes has raced ahead of programming paradigms and the time available to programmers. Alternative “manycore” processors, such as general-purpose GPUs, are adding to the programming challenge. This parallel performance “wall” will reshape the nature of HPC code design and system usage.
Given this situation, it seems almost inevitable that a fair number of HPC applications will need to be fundamentally rethought and rewritten within the next five to 10 years to take better advantage of highly parallel supercomputers. That means serious algorithm development, because we’ve already seen examples of mathematical models and algorithms that broke when pushed beyond their design limits. There aren’t enough people on Planet Earth with the right kind of brainpower and experience to tackle this, but there are some. We’ll need to motivate universities to produce more.
Parallel global address space (PGAS) programming languages have existed for over a decade and could be far more productive than the message passing interface (MPI) that dominates HPC programming today. But few HPC users are ready to learn a new language that would also require rewriting HPC applications that could contain tens or hundreds of thousands of lines of code.
Yet, revolutionary changes such as rewriting applications and adopting new programming languages may still be on the horizon, because the evolutionary path is not addressing the fundamental problems users face, most notably the Herculean task of writing code for the largest systems that may soon have a million or more cores.
In sum, the hardware direction in HPC seems set for the foreseeable future, with hard-to-exploit commodity-based systems now the rule. These systems have increasingly conquered the HPC market by offering irresistible price/performance, and this has helped make HPC pervasive enough that it is used today not just by government and university researchers but to design products ranging from cars and planes to golf clubs, microwave ovens, animated films, potato chips and diapers.
The biggest challenges lie on the software side — how to create algorithms, system and management software, programming tools and applications that efficiently corral the runaway parallelism of the hardware. This provides new opportunities for nations and global regions, including software powerhouse Europe, to advance to leadership positions in the worldwide HPC market.
Steve Conway is Research VP, HPC at IDC. He may be reached at editor@ScientificComputing.com.