We computational chemists are an impatient lot.
Despite the fact that we routinely deal with highly complicated chemical processes running on our laboratory’s equally complex high performance computing (HPC) clusters, we want answers in minutes or hours, not days, months or even years. In many instances, that’s just not feasible; in fact, there are times when the magnitude of the problem simply exceeds the capabilities of the HPC resources available to us.
ENTRIES OPEN: Establish your company as a technology leader. For 50 years, the R&D 100 Awards, widely recognized as the “Oscars of Invention,” have showcased products of technological significance. Learn more.
A major step toward dealing with these issues occurred in 1997 with the formation of the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL) located at Pacific Northwest National Laboratory (PNNL) in Richland, WA. EMSL is a DOE Office of Science national scientific user facility sponsored by the Office of Biological and Environmental Research and located at PNNL.
EMSL (part of the US Dept. of Energy research complex) supports researchers who are developing innovative solutions to environmental and energy production challenges in areas such as atmospheric aerosols, feedstocks, global carbon cycling, biogeochemistry, subsurface science and energy materials.
The lab’s computational capabilities include a supercomputer with petaflop performance. This HPC system has been designed specifically to solve large computational chemistry and biochemistry problems that combine computation with experiment and theory. To support the supercomputer, PNNL developed the Global Arrays toolkit programming interface that enables the user to program both homogenous and heterogeneous massively parallel systems.
A lot of other sophisticated equipment is available as well — for example, a suite of mass spectrometers, and an extensive collection of surface characterization and deposition instruments. EMSL also has a variety of subsurface flow and transport tools to study the characteristics of environmental contaminants.
Of particular interest to computational chemists is NWChem, leading-edge software that offers a wide variety of highly scalable, computational chemistry methods. Having these kinds of capabilities available to researchers has fundamentally changed how we go about the work of computational chemistry. We are now able to investigate more complex structures and reactions at both the quantum and classical levels, and to build more complex, detailed and accurate computational models.
But despite these advances, we still face a number of challenges.
For example, the widespread use of highly correlated methods in electronic structure calculations depends on the interplay between advances in theory and the possibility of using the constantly growing computer power of emerging heterogeneous HPC architectures. This represents a great opportunity, but also a great challenge to those of us working in the field.
Before this kind of computational power became available, running complex simulations on HPC systems based on CPUs alone was feasible but, in some instances, took an inordinate amount of time. Of course, employing coprocessors significantly reduced the time to solution.
An even more fundamental problem is the fact that, in the past, computer codes were not designed to take advantage of supercomputers or powerful HPC clusters with hundreds of thousand or millions of processors and a high degree of parallelism. The process of rewriting legacy software to run efficiently on these new architectures — a process sometimes known as code modernization — is in full swing today.
This is not a process that will happen overnight. Updating essential scientific applications to allow them to take full advantage of these systems’ parallelism and scalability is difficult and time-consuming work that can be overlooked or shunted aside by the crush of other priorities. In some cases, it will be dropped in the lap of the domain expert who, although highly knowledgeable in his or her field, does not have the background, time or resources to optimize applications written for today’s complex, highly parallel multicore HPC systems.
At EMSL, harnessing the power of these new heterogeneous architectures is part of our charter to advance our explorations in computational chemistry.
Our primary computing system for these endeavors is Cascade, a petaflop class Linux-based cluster with 23,000 Intel processors with 184,000 gigabytes of memory available — some four times more memory per processor when compared to other supercomputers.
Its specifications include:
- Theoretical peak processing speed of 3.4 petaflops with an actual Rmax Linpak performance of 2.5 petaflops
- 1440 compute nodes with Intel Xeon processors plus Intel Xeon Phi coprocessors
- 23,040 Intel Xeon processor cores
- 195,840 Intel Xeon Phi coprocessor 5110P cores
- 128 GB memory per computer node (8 GB per Intel Xeon processor core)
- Mellanox FDR Infiniband network
- 2.7 Petabytes Lustre parallel filesystem using the DDN EXAScaler Parallel Filesystem using DDN SFA12k storage
EMSL has developed software that exploits the massive parallelism and processing power of Cascade. At least two programs enable scientists to understand complex chemical systems at the molecular level, enabling research that was never before possible.
- Northwest Computational Chemistry Software (NWChem) — NWChem is molecular modeling software developed to take full advantage of a wide range of parallel computing systems, including Cascade. This highly scalable software provides a variety of methods to compute the properties of molecular and periodic systems by using standard quantum-mechanical descriptions of the electronic wavefunction or density. NWChem also functions at the classical level, allowing scientists to perform mixed quantum mechanics and molecular mechanics simulations.
- Global Arrays (GA) toolkit — GA forms the high performance, efficient and portable software layer that allow NWChem to run on a wide variety of parallel computing systems with leading-edge performance and scalability. In addition, GA was used by the EMSL group to achieve desired performance using the Intel Xeon processor and Intel Xeon Phi coprocessor combination to perform software maintenance without rewrites necessary for GPU changes. The Global Arrays toolkit can be seen as an efficient and portable “shared-memory” programming interface for distributed memory computers.
More about NWChem
NWChem is a general-purpose computational chemistry code designed specifically to deal efficiently with large scientific problems. It is used by thousands of scientists worldwide to investigate chemical processes by applying theoretical techniques and complex simulations to predict their structure, properties and reactivity. Chemical and biological species under investigation can vary in size from tens to thousands of atoms. This includes addressing reactive chemical processes occurring in our everyday environment — for example, photosynthesis, protein functions, aerosol formation and catalysis.
The software can be implemented on parallel computer resources ranging from conventional workstation clusters to high performance supercomputers. Part of the on-going development work at EMSL is to equip NWChem to efficiently use heterogeneous, multicore HPC systems with peak performance running well into the petaflops.
The core functionality of NWChem is a computational chemist’s delight. Among its attractions are: many-body high-accuracy methods; density functional theory methods for both plane-wave and Gaussian basis sets; and focus on molecular dynamics and various multi-scale methods.
The developers of the software realized early on that to scale its many functions required an inherently different programming model compared to what the typical message passing solution offered. The result was the Global Arrays. This programming model supports scalability and fault tolerance, which allows NWChem to be run on a variety of supercomputers with a minimum of coding changes to handle different architectures.
Global Arrays provides the underlying infrastructure that makes NWChem one of the fastest and best-scaling tools available for large machines. The infrastructure allows us to focus on particular, desired properties of molecular systems, and our theoretical calculations can be used to drive experiments.
Specifically, the software provides a variety of methods for determining molecular electronic structure, molecular dynamics and pseudopotential plan-wave electronic structure.
Included are:
- Molecular electronic structure methods — A class of methods for determining energies and analytic first derivatives with respect to atomic coordinates (Hartree-Fock is one of those methods).
- Quantum mechanics/molecular mechanics — QM/MM provides a comprehensive set of capabilities to study ground and excited state properties of large molecular systems.
- Pseudopotential plane-wave electronic structure — This module performs DFT (Density Functional Theory) calculations to create simulations in the study of complex molecular, liquid and solid-state systems.
- Molecular dynamics — The MD module performs classical simulations using the AMBER and CHARMM force fields.
Making full use of these and other advances in both computational theory and tools is dependent on using the rapidly-growing computer power of heterogeneous architectures. At EMSL we have been working to demonstrate the feasibility of adopting the Intel Xeon Phi coprocessor based on the Intel Many Integrated Core (MIC) architecture in order to develop efficient computational chemistry modeling tools. We have realized remarkable scalability according to our benchmarks. Our solution scales up to a total of 63560 cores with the concurrent utilization of Intel Xeon processors and Intel Xeon Phi coprocessors.
NWChem at Work
To illustrate the diversity of the research supported by NWChem, here are just a few of the many avenues of investigation underway at EMSL:
- quantum mechanical method for describing the response properties of molecules adsorbed on metal nanoparticles
- ground state DFT calculations using Gaussian atom centered Gaussian type obitals to study molecular, finite clusters and nanosystems
- use of TDDFT (time dependent density functional theory) to conduct reliable excited-state calculations involving single excitations in a wide range of systems spanning molecules to materials
- reliable characterization of excitation energies and excited-state potential energy surfaces with high-order Equation-of-Motion Couple Cluster (EOMCC) formalism
Exploring NWChem and Parallel Computing
Because computational chemistry increasingly relies on the capabilities being offered by the latest heterogeneous HPC systems, a team of EMSL researchers recently investigated the impact of using advanced GPU coprocessors in the context of the NWChem computational chemistry package. In this case, we examined the impact of the Intel MIC architecture and the Intel Xeon Phi coprocessor on the implementation and performance of the highly accurate CCSD(T) quantum chemistry method.
Computational chemistry requires accuracies that can only be obtained by using a limited class of methods that can determine the instantaneous interactions between electrons or correlation effects in molecules. These high levels of accuracy are essential to bridge the gap between theory and experiment. Coupled Cluster (CC) formalism has the advantage over other methods because the correlation effects are very efficiently encapsulated in the exponential form of the wavefunction. This feature is essential for describing complex chemical processes.
Given the importance of the CCSD(T) formalism, we have made significant progress in developing codes that can fully exploit the processing power of large-scale parallel computers. The NWChem code now includes several distinguished CCSD(T) implementations.
The first is based on the Kobayashi-Rendell algorithm, which is used for the study of closed shell molecules (molecules without unpaired electrons). Because of the inherent parallelism of the non-iterative methods used, the CCSD(T) implementation was able to scale across 250,000 cores.
The second implementation was based on the Tensor Contraction Engine (TCE) approach, which also demonstrated excellent scalability. When compared to the closed-shell CCSD(T) implementation, the TCE CCSD(T) code has a wider applicability, since it uses a spin-orbital formalism that can deal with closed- and open-shell systems. This generates a physical model of increased complexity, which, in turn, results in more complicated algebraic formulations with higher computational requirements.
The reason for this complexity is that scientists use elaborate mathematical matrices, or tensors, which contain tens of millions to tens/hundreds of billions of elements. The modeling process involves dozens to hundreds of manipulations called tensor contractions, which are quite complex and hard to program efficiently.
TCE, which was designed at PNNL, takes advantage of the underlying structure of algebraic equations to wind up with what is characterized as the lowest number of necessary quantum operations. By automatically generating the efficient but complex computer code that chemists, physicists and material scientists need to model the structure and interaction of complete molecules, TCE saves these researchers weeks, even months of work.
The TCE CCSD(T) algorithm has been ported to the Xeon Phi coprocessor with the aim of leveraging the attached processors’ computational power and minimizing data movement between host and accelerator.
Implementing CCSD(T) on the Intel Xeon Phi coprocessor
When, several years ago, we added our first round of GPU capabilities to our HPC infrastructure, we had to turn to another group within PNNL that was familiar with the proprietary GPU development toolkit.
When it came time to replace the original GPUs with Intel hardware, we had an easier time of it. Because we were already familiar with Intel-based systems, we were able to handle the implementation with our EMSL team. In addition we used the OpenMP programming model, which provided a relatively simple way to modify codes. As a result, development time was considerably reduced — it only took us about four months to generate a working version of the code; we had anticipated it would take about twice as long, and we were pleasantly surprised.
We collaborated with Michael Klemm, a Senior Application Engineer at Intel, who helped us develop the Intel MIC CCSD(T) code.
Performance of the CCSD(T) implementation
As our benchmark for this project, we decided to examine the performance of NWChem and the PNNL HPC system when computing the correction to the CCSD(T) correlation energy of the closed-shell pentacene molecule.
The hardware involved included Cascade, the EMSL/PNNL HPC system which consists of dual-socket Intel Xeon E5-2670 processors with 2.6 GHz and 128GB of main memory. Each node houses 2x Intel Xeon Phi coprocessor 5110P with 60 cores at 1.053GHz and 8GB on-card memory. The benchmark was run on up to 460 nodes of the cluster for a total of 7360 host cores and 62560 aggregate cores.
On the software side, we needed to keep a large fraction of the Fortran code, so we restricted ourselves to high-level code transformation. This not only improved performance on the coprocessor, but also proved to be beneficial for the host part of the system, providing general, overall performance gains.
We found that running our project on the cluster of Intel Xeon processors and Intel Xeon Phi coprocessors resulted in other performance boosts as well. For example, we found that by carefully following a pre-determined procedure when setting up our runs for the heterogeneous system resulted in a 30 percent reduction of time spent for data transfers.
During the course of this experiment, we discovered how incremental modifications to the floating-point intensive portions of a large and complex quantum chemistry code like NWChem can successfully exploit the computational power of the Intel Xeon Phi coprocessor. The adoption of a pragma-based offload model — which keeps most of the existing Fortran code unmodified — resulted in a focused porting effort, once the existing implementation was restructured to avoid unnecessary data transfers.
The high level loop transformations applied to the computer kernels improved both the coprocessor and the host performance without the need for architecture-specific, non-portable coding. Our performance results show that the compute power of a cluster of coprocessors can be used to apply simulation techniques that might be too costly on more conventional CPU-based clusters.
In particular, the TCE CCSD(T) implementation, which uses a compute-intensive approach based on spin-orbitals, benefits from the addition of GPU-based processing power.
Moving Ahead
In the meantime, we will continue to optimize the system. For example, we plan to use asynchronous offloads and perform computation in the offloading thread on the host while the current offload is active on the target.
We will also investigate if data transfers can be further optimized by streaming data in chunks to and from the coprocessor to overlap data transfers, communication and computation. This would increase overall system utilization on all levels (interconnect, host CPU, coprocessor and PCIe bus).
Finally, we are working on further optimizations of the offloaded kernels that will allow the coprocessor to provide even higher computational efficiency. We are also working towards extending the Intel MIC programming model to excited-state CC methodologies and multi-reference CC formulations.
Getting Better Results Faster
Overall, the implementation of a powerful HPC heterogeneous computational infrastructure and the on-going modifications and upgrades to NWChem are allowing us to dramatically cut our time-to-results from weeks and months to hours and days. We can build more complex and accurate computational models and tackle problems that were totally beyond our reach. And we are opening up new avenues of research that, in the past, we could not even imagine.
Computational chemistry will never be the same.
Karol Kowalski is Acting Technical Lead and Senior Research Scientist, and Edoardo Apra is Chief Scientist at WR Wiley Environmental Molecular Sciences Laboratory, MSC Performance Software. They may be reached at editor@ScientificComputing.com.