Starting almost 10 years ago, the scientific and engineering community has been identifying the advances that exascale computing might enable, as well as the challenges involved in building and using affordable exascale computers. The challenges have, in particular, led to a great deal of anxiety and predictions that most of the current HPC ecosystem will have to be discarded and completely new solutions will have to be crafted for the hardware components, system architecture, programming languages and models, algorithms, and operating systems.
Having observed and participated in scientific computing for slightly over five decades (51 years to be precise, but five decades sounds more impressive, doesn’t it?), I have been mulling over whether the transition from petascale to exascale computing capability will be dramatically different and more difficult to achieve than previous transitions. Yes, being of advanced age, I am vulnerable to being accused of embellishing the hardships that we faced in previous generations; so be it. At least I will not say that I had to walk 10 miles in the snow to get to my elementary school.
My first observation (that admittedly I first talked about a decade ago) is that the evolution of scientific computers and computing has not been linear. Every so often, there are advances in one or more technologies or architectures that cause substantial, sometimes disruptive, change in the computing environment and impacts how one programs and uses the systems efficiently. For a few years, there is typically a lot of activity to adapt to these changes followed by a period of little, gradual change.
- Read more: A Q&A with Paul Messina, Director of Science for the Argonne Leadership Computing Facility
This pattern is similar to the “punctuated equilibria” theory in evolutionary biology that was postulated in 1972 by N. Eldredge and S.J. Gould. The appearance of a new species of computer comes along infrequently, but it typically does result in rapid changes to applications, compilers, languages and programming models, as well as system architecture. My informal categorization of transitions is:
- Serial to HPC serial, meaning that systems were designed to be effective for scientific calculations. As noted below, one example is introduction of memory architecture features aimed at reducing data access times.
- HPC serial to vector. In addition to hardware architecture advances, vectorizing compilers played a major role in facilitating porting applications to vector machines.
- Vector to parallel. Here, many architectural flowers bloomed, a number of which are perennial:
- SIMD such as ILLIAC IV, ICL DAP, Thinking Machines CM-1, CM-2
- MIMD: a plethora of variants to numerous to list but surely well-known to readers of Scientific Computing
- MIMD with vector processors as the building blocks
- VLIW-based systems
- Parallel to massively parallel: systems with thousands to hundreds of thousands of processors. These transitions can be thought of as having multiple stages, as the peak speed of the largest systems increased from hundreds of gigaflops to teraflops and petaflops.
My second observation is that the challenges that arise during the periods of rapid change are often the same, albeit at different scales. For example, I/O bandwidth that can keep up with the volume of data that can be produced on today’s systems is a major challenge. But, as far back as the mid 1960s, when Seymour Cray designed the CDC 6600, he was aware that there was considerably more to performance than simple processor speed, that I/O bandwidth had to be maximized as well in order to avoid “starving” the processor of data to process or idling the system while results were being written to disk. As he later noted, “Anyone can build a fast CPU. The trick is to build a fast system.” Likewise, memory bandwidth and size was an issue. In 1969, the CDC 7600, also designed by Cray, had two tiers of memory, small core memory (SCM) and large core memory (LCM). SCM had 10 times faster access time than LCM. IBM introduced cache memory in its System/360 models 85 and 195. Argonne National Laboratory had a 360/195, and I recall my surprise when a user pointed out to me that he was able to do some matrix calculations 25 times faster by arranging the calculations in blocks that fit into the cache.
Five system architecture and packaging challenges that I have seen arise repeatedly over the various transitions are memory size, data movement time, I/O, cooling and energy use. All five are among the primary challenges that are posed by exascale systems. Of course, that does not mean that addressing them will be simple just because similar ones have been tackled many times, but at least we can be optimistic and expect that solutions will be found without having to change everything.
In particular, energy efficiency has been identified as the number one challenge in building an affordable exascale system. I do not question the enormity of the challenge, but feel compelled to point out that in studies of the challenges in building petascale systems, energy efficiency was considered to be a major challenge. A bit of background: in 1993 Dan Goldin, then NASA Administrator, requested the first study on petaflops. In June 1993, in the first-ever Top500 list, the fastest system was the Thinking Machines CM-5 at Los Alamos with a peak speed of 131 gigaflops. In November 1993, the fastest system on the Top500 list was the Fujitsu Numerical Wind Tunnel with a peak speed of 236 gigaflops. Goldin wisely wanted to aim not for teraflops but for petaflops, nearly 10,000 times faster than the fastest systems in 1993. In February 1994, the workshop in Enabling Technologies for Petaflops Computing was held and was followed by a number of other workshops and analyses on the challenges for building petaflops systems. Back to energy: one of the subsequent studies by NASA identified power as the biggest obstable to petaflops because of cost. [A Petaflops Era Computing Analysis, Frank S. Preston, Computer Sciences Corporation, Hampton, Virginia, NASA/CR-1998-207652.].
Application programming issues that have been present in various forms since at least the 1960s include portability, reimplementation of algorithms to exploit new architectural features and achieve the speedup that they can enable, and programmability. Examples of portability issues in the past include floating-point arithmetic units that implemented basic operations, such as division, differently. The IEEE arithmetic standard eliminated almost all the variation. Another persistent issue is that different compilers implement language standards differently. Some had extensions that were not labeled as such and led programmers to write non-standard code unwittingly; when they started using a computer of a different brand, sometimes the code would not run. As an indication of the scale of the problem, Brian Smith wrote a useful article in 1976 titled “Fortran Poisoning and Antidotes” that described how to write portable Fortran. In the early 1990s, there were many variants of message-passing libraries. A few dozen members of the scientific computing community banded together and developed the MPI standard.
Another finding in the 1995 analysis of petascale system challenges is on programmability: “The challenge of software technology — to achieve good performance from relatively easy-to-use systems — is formidable at the teraflops scale and looks nearly insurmountable at the petascale.”
A 1997 study of software for petaflops stated that: “It is conceded by supercomputer system architects, designers and users that the algorithms now in use for computationally demanding jobs will need to be refined or replaced for the petaflops era and beyond. The same can be said for the operating systems and programming languages.”
Sound familiar? Actually, when speaking of exascale, many people leave out the option of “refined” and state that nearly everything must be replaced. My observation is that refinement sufficed in the transition to petascale.
I can’t resist noting one more example of a prediction that so far has not come to pass, namely, MPI will never work for one million ranks. A couple of years ago, full applications ran efficiently using three million ranks, and a fine-grain variant of MPI (FG-MPI) was able to run 100 million MPI ranks on 6,480 cores of a cluster. (FG-MPI implements MPI processes as coroutines instead of regular OS processes). Which brings me to my final observation in this essay: it is important to distinguish algorithms, methods, programming models, etcetera, from their implementation. In many cases, it is possible to adapt the implementation of familiar and proven tools so that they will perform well on the next generation of HPC systems.
In summary, in my opinion, many challenges posed by technology trends are not fundamentally different from those posed by previous transitions.
Instead of wringing our hands and claiming we need to abandon our current technologies (hardware and software writ large), we should examine carefully the software ecosystem that works and is widely used today and then decide what to throw out, what to refine/adapt, and what may require inventions.
Paul Messina is Director of Science at Argonne Leadership Computing Facility.