Modified NWChem Code Utilizes Supercomputer Parallelization

Molecular dynamics, quantum chemistry and quantum materials researchers use HPC resources to perform highly compute-intensive mathematical calculations and to simulate images of molecular structures. While HPC systems have been updated to increase processing speed and parallelization, many HPC codes are not optimized to take advantage of coming hardware advances, such as next-generation Intel Xeon Phi processors. This is the first in a series of articles, told from the perspective of the individuals who are updating the codes, that describes the changes various national laboratories and scientific research centers are making to improve code optimization and parallelization in order to reduce scientists’ time to discovery.

Quicker time to discovery. That’s what scientists focused on quantum chemistry are looking for. According to Bert de Jong, Computational Chemistry, Materials and Climate Group Lead, Computational Research Division, Lawrence Berkeley National Lab (LBNL), “I’m a computational chemist working extensively with experimentalists doing interdisciplinary research. To shorten time to scientific discovery, I need to be able to run simulations at near-real-time, or at least overnight, to drive or guide the next experiments.” Changes must be made in the HPC software used in quantum chemistry research to take advantage of advanced HPC systems to meet the research needs of scientists both today and in the future.

Read more: Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization

NWChem is a widely used open source software computational chemistry package that includes both quantum chemical and molecular dynamics functionality. The NWChem project started around the mid-1990s, and the code was designed from the beginning to take advantage of parallel computer systems. NWChem is actively developed by a consortium of developers and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. NWChem aims to provide its users with computational chemistry tools that are scalable both in their ability to treat large scientific computational chemistry problems efficiently, and in their use of available parallel computing resources from high-performance parallel supercomputers to conventional workstation clusters.

“Rapid evolution of the computational hardware also requires significant effort geared toward the modernization of the code to meet current research needs,” states Karol Kowalski, Capability Lead for NWChem Development at PNNL.

Introducing Intel Parallel Computing Centers LBNL and PNNL

Both Pacific Northwest National Laboratory (PNNL) and Lawrence Berkeley National Laboratory (LBNL) are Intel Parallel Computing Centers (IPCCs) dedicated to optimizing NWChem code to effectively use the capability of the Intel Xeon Phi coprocessor. The main goal of the two teams is to optimize and modernize the NWChem code toward effective utilization of the emergent hybrid parallel computer systems based on the Intel Many Integrated Core Architecture (Intel MIC) technology to enable the scientific community to pursue new frontiers in the fields of chemistry and materials modeling.

The aim of the LBNL and PNNL work is to create an optimized version of NWChem that enables the scientific community to pursue new frontiers in the fields of chemistry and materials and climate modeling. LBNL research and development focuses on implementing greater amounts of parallelism in the codes, starting with simple modifications such as adding/modifying OpenMP pragmas and refactoring to enable effective vectorization for performance improvement, all the way to exploring new algorithmic approaches that can better exploit manycore architectures.

Catalytic materials used in conversion of cellulose to sugars and bio-oils

In-silico design is critical in accelerating the development of new catalysts and chemical reaction and transformation processes that tackle key scientific and engineering challenges in the production of sustainable products in efficient, environmentally friendly and cost-effective ways at industrial scale. Heterogeneous catalysis has a rich history of facilitating energy-efficient selective chemical transformations and contributes to 90 percent of chemical manufacturing processes. Catalysts are central to overcoming the engineering and scientific barriers to economically feasible routes for the conversion of biomass-derived and solar-mediated fuel and chemicals into usable products.

“An example is the conversion of cellulose into sugars and bio-oils, which through catalytic processes can be converted into biofuels or building blocks for industrial applications. Accurate simulations of the kinetics and thermodynamics of chemical transformations enable scientists to discover new and novel ways to predict, control and design optimal — industrially viable — catalytic activity and selectivity by rapidly scanning the large design space. It is crucial for the catalyst development process that compute-intensive computational chemistry simulations with NWChem run as efficiently on the next-generation computing platforms with the fastest time-to-solution possible,” states de Jong.

Chem algorithms to take advantage of the new HPC architectures. Their work addresses the performance of several computational drivers in NWChem in these areas: Gaussian Density Functional Theory (DFT) methods, plane wave DFT formulations and multi-reference coupled cluster methods. The team is reworking codes with four major aims in mind:

Figure 2: Wall time to solution(s) for the multi-reference coupled cluster methods. All tests have been performed for the C20 system in cc-pVDZ basis set (http://pubs.acs.org/doi/abs/10.1021/acs.jctc.5b00957 ) Working to modernize NWChem code

PNNL is rewriting NWChem algorithms to take advantage of the new HPC architectures. Their work addresses the performance of several computational drivers in NWChem in these areas: Gaussian Density Functional Theory (DFT) methods, plane wave DFT formulations and multi-reference coupled cluster methods. The team is reworking codes with four major aims in mind:

improvement of the parallelization up to hundreds of thousands of cores
modifications of the intra-node parallelization by using threading approaches
a complete rewrite of the major-kernels in order to better exploit the wide vector units available in the present and upcoming computer architectures
reducing the computational cost for large scale ground and excited-state simulations

In particular, the PNNL Intel PCC project focuses on the development of Intel MIC implementations of the so-called multi-reference coupled cluster (MRCC) methods, which are considered one of the most accurate methods in quantum chemistry.

“These methods can be used to describe bond breaking processes, transition states in molecules, open-shell low-spin electronic states, and excited states. Through combining novel algorithms for harnessing combined power of Intel Xeon processors and Intel Xeon Phi co-processors with several levels of parallelism, the PNNL team has achieved a considerable (3X) speedup of the MRCC codes. The achieved progress will allow us to apply expensive CC methodologies to model challenging problems in catalysis and to describe important enzymatic reactions,” indicates Kowalski.

Figure 3: Run time results when running baseline OpenMP. Courtesy of LBNL

Using offloads, Intel LEO directives, OpenMP 4.0 to optimize code

The PNNL team is also using a directive-based offload model and OpenMP 4.0 in their work. The team uses an offload model that keeps large parts of the existing Fortran code unmodified by inserting Intel Language Extensions for Offloading (LEO) constructs. This approach significantly reduces time needed to develop and validates the code and, at the same time, provides tangible performance speedups associated with the utilization of the Intel Xeon Phi coprocessors. While the Intel Xeon Phi coprocessor supports several offload programming models, each with unique properties, only the Intel LEO constructs — unfortunately a proprietary offload language that predates the availability of the target directives in OpenMP 4.0 — provide the needed flexibility of transfer of data and control, and required incremental changes to the existing Fortran source code.

Figure 4: Run time speedup results when running the optimized OpenMP code. Courtesy of LBNL “Novel offloading algorithms developed in the last two years at PNNL not only helped to reduce time-to-solution of accurate methods, but also improved the scalability of the targeted implementations,” said Dr. Edoardo Aprà the lead author of the Supercomputing 2014 paper reporting on the first Intel MIC implementation of the ubiquitous CCSD(T) formalism.

PNNL is currently using the OpenMP multithreading features for the parallelization of floating-point intensive computational kernels. They plan to convert the Intel LEO directives currently used for offloading to the target directives that are part of the OpenMP 4.0 standard.

NERSC User Community

LBNL works closely with the National Energy Research Scientific Computing center (NERSC) and is undertaking an extensive outreach and education effort to ensure that the lessons learned in optimizing NWChem code are disseminated to the broader user community at NERSC. “The aim will be to supplement the training and outreach efforts NERSC is already undertaking to support its users on its current Intel Xeon processor-based Cray XC30 supercomputer ‘Edison.’ Additionally, the work will form part of the application-readiness efforts NERSC is undertaking as part of the expected delivery of its Intel Xeon Phi processor code name Knights Landing based Cori supercomputer in 2016,” states Nicholas J. Wright, Advanced Technologies Group Lead, National Energy Research Computing Center. To help prepare NERSC’s large user base to run on the Cori supercomputer, they launched the NERSC Exascale Scientific Application Program (NESAP) (http://www.nersc.gov/users/computational-systems/cori/nesap. Through NESAP, NERSC staff and postdocs are working with large application teams to optimize their codes including NWChem.

Making modifications to accelerate simulations

The LBNL team is modifying and optimizing the NWChem code to more fully utilize features of the Intel Xeon Phi processor. The team is focused on end-to-end treading of plane wave capability, the Hartree-Fock and Density Functional Theory (DFT) Fock build, and coupled cluster triples on Intel Xeon Phi processors. The work of the LBNL team is key to getting NWChem ready for the arrival of NERSC’s Cori supercomputer, which will feature the Knights Landing Xeon Phi processor code name Knight Landing.

“Our goal is to at least double the performance of computational chemistry and materials software, specifically NWChem, to accelerate scientific discovery. We want to use the Intel processors as efficiently as possible to get the fastest time-to-solution for the science problems we are trying to simulate using Intel processors,” states de Jong.

The LBNL team has found the integral computation to be poorly vectorizable and is having thread-safety issues with data structures present in the legacy code; they will be integrating other better vectorizable integral codes, and will explore implementing a new algorithm, using AVX-512 instructions, to improve their performance. They are also looking at using heterogeneous computing approaches, where different processors work on different operations, instead of simple OpenMP constructs to increase the amount of parallelism available to the processors. For the Fock matrix build in the Hartree-Fock or DFT algorithm, they achieved a 1.6x speedup running natively on the Intel Xeon Phi coprocessor.

LBNL achieved a 3.3x speedup for the triples correction in the coupled cluster algorithm using both the Intel Xeon processor and the Intel Xeon Phi coprocessor. The key to achieving this speedup was to optimize not just the expensive loops (for which the team integrated advances made by the joint PNNL-Intel efforts), but integrating OpenMP directives in all parts of the triples code (such as those that set up data structures and the summations that occur as part of the post processing for the triples energy).

“Our latest work is on the plane-wave algorithm where the most time-consuming parts are the FFT (Fast Fourier Transform) and matrix multiplications on tall and skinny matrices (200×10,000, for example). Working together with PNNL, we have transformed the complete plane-wave algorithm to utilize OpenMP. Our work at LBNL has focused on the tall and skinny matrices. We developed a reduce algorithm for these special matrix multiplications using OpenMP teams and we achieved a 5.5x speedup over Intel’s MKL library on the Intel Xeon processor E5-2680 v3, formerly known as Haswell. We’re currently implementing this in our plane-wave algorithm and will be doing benchmarking on the whole code to assess the overall speedup,” states de Jong.

Hardware used by PNNL

The PNNL Cascade supercomputer has 1440 compute nodes with two eight-core 2.6GHz Intel Xeon processor E5-2670 (for a total of 23040 cores). Each Xeon E5-2670 has 128GB of GDDR5 of memory. Additional computing power is provided by two Intel Xeon Phi coprocessor 5110P cards on each node. Each Intel Xeon Phi coprocessor is made of 60 cores running at 1.05GHz (for a total of 172,800 cores) and has 5GB of GDDR5 local memory. Communication between the nodes is obtained using a Mellanox Infiniband FDR interconnects. A 2.7 Petabytes parallel filesystem using the DDN EXAScaler Parallel Filesystem using DDN SFA12k storage is present.

How HPC will aid chemistry and materials research

The goal at PNNL is to optimize coding efforts in such a way that they will be easily available to integrate into other open-source computational chemistry software efforts.

PNNL, LBNL and Georgia Tech are working together on the project to update NWChem code. NWChem code optimized to run on advanced hardware, such as the Intel Xeon Phi coprocessor, will be reviewed and released so that it can be used by all NWChem users and researchers.

Scientists are currently restricted in what they can do in their research by limitations of the supercomputers or HPC software models. They often have to reduce their research models to make them fit in the HPC resources currently available. Researchers need to be able to use the next-generation computers to run large computations faster in research areas such as solar cells or to develop novel complex organic-inorganic materials for batteries, photovoltaic or thermo-electrics that can serve sustainable energy sources.

“Many of our algorithms inherently do not have long vector loops that can be easily transformed using OpenMP. I am not sure OpenMP is the complete solution. We will have to develop models that utilize heterogeneous computing on nodes with different threads/cores working on different parts of the algorithm simultaneously,” states de Jong.

Modernization of underlying codes for efficient use of manycore architectures is required for benefits to be realized. According to Kowalski, “The rapid development of computer technology and emergence of petascale hybrid architectures offer an enormous increase in computational power, which will have a transformative effect on the whole area of computational chemistry, including the shift in the system-size limit tractable by many-body methods and opportunity of using accurate many-body techniques, which are characterized by steep numerical scaling.

The main goal is to enable accurate methods capable of describing complicated electron correlations effects, stemming from the instantaneous interactions between electrons in molecules. This includes two major classes of methodologies:

plane-wave formulations of the ubiquitous density functional theory
wavefunction-based methodologies utilizing the so-called coupled-cluster expansion. It is well-known that proper inclusion of these effects is indispensable for achieving the so-called chemical accuracy necessary to understand chemical transformations relevant to catalysis, biochemistry, geochemistry, and atmospheric chemistry to mention only few application areas.”

References

H. Shan, S. Williams, W.A. de Jong, L. Oliker, “Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture”, in PMAM 2015: The 2015 International Workshop on Programming Models and Applications for Multicores and Manycores, pp 58-67 (2015), http://dx.doi.org/10.1145/2712386.2712391
H. Shan, B. Austin, W.A. de Jong, L. Oliker, N. Wright, E. Apra, “Performance Tuning of Fock Matrix and Two-Electron Integral Calculations for NWChem on Leading HPC Platforms”, in High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. Lecture Notes in Computer Science, 261-280 (2014), http://dx.doi.org/10.1007/978-3-319-10214-6_13
http://www.nersc.gov/users/computational-systems/cori/cori-phase-ii/

PNNL

M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha, E. Aprà, T.L. Windus, W.A. de Jong, NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations, Computer Physics Communications, Volume 181, Issue 9, September 2010, Pages 1477-1489, ISSN 0010-4655, http://dx.doi.org/10.1016/j.cpc.2010.04.018
Edoardo Aprà, Karol Kowalski, Jeff R. Hammond and Michael Klemm, Chapter 17 – NWChem: Quantum Chemistry Simulations at Scale, In High Performance Parallelism Pearls, edited by James Reinders & Jim Jeffers, Morgan Kaufmann, Boston, 2015, Pages 287-306, ISBN 9780128021187, http://dx.doi.org/10.1016/B978-0-12-802118-7.00017-0.
E. Aprà and K. Kowalski, Implementation of High-Order Multireference Coupled-Cluster Methods on Intel Many Integrated Core Architecture, Journal of Chemical Theory and Computation Article ASAP
http://dx.doi.org/10.1021/acs.jctc.5b00957
http://www.pnnl.gov/news/release.aspx?id=965&
http://www.top500.org/system/178250
http://www.nwchem-sw.org/index.php/Main_Page
Interview: Advancing Computational Chemistry with NWChem

LBNL

Linda Barney is the founder and owner of Barney and Associates, a technical/marketing writing, training and web design firm in Beaverton, OR.

R&D 100 AWARD ENTRIES NOW OPEN: Establish your company as a technology leader! For more than 50 years, the R&D 100 Awards have showcased new products of technological significance. You can join this exclusive community! Learn more.

Related Articles Read More >

Luma Lab Connect unlocks value of lab data to accelerate scientific R&D decision-making

QDx Pathology Services adopts Proscia’s software to improve speed and precision

Q-CTRL integrates their error suppression technology into IBM Quantum services

New GraphPad Prism offers open file format, plus collaboration, integration capabilities

Search R&D World