This is the fifth installment in a series covering how scientists are updating popular molecular dynamics, quantum chemistry and quantum materials code to take advantage of hardware advances, such as the forthcoming Intel Xeon Phi processors.
Quantum-mechanical materials and molecular modeling research is the science for materials modeling at the nanoscale. Quantum materials research examines elementary particles using a mathematical interpretation of the structure and interactions of matter. This research has a wide range of applications, such as studying molecular systems for material assemblies, small chemical systems and studying biological molecules. High performance computing (HPC) systems are required for complex quantum materials research, due to the amount of data and the computation power required for calculating mathematical formulas and generating images.
Researchers use specialized software such as Quantum ESPRESSO and a variety of HPC software in conducting quantum materials research. Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modeling, based on density-functional theory, plane waves and pseudo potentials. Quantum ESPRESSO is coordinated by the Quantum ESPRESSO Foundation and has a growing world-wide user community in academic and industrial research. Its intensive use of dense mathematical routines makes it an ideal candidate for many-core architectures, such as the Intel Xeon Phi coprocessor.
The Intel Parallel Computing Centers at Cineca and Lawrence Berkeley National Lab (LBNL) along with the National Energy Research Scientific Computing Center (NERSC) are at the forefront in using HPC software and modifying Quantum ESPRESSO (QE) code to take advantage of Intel Xeon processors and Intel Xeon Phi coprocessors used in quantum materials research. In addition to Quantum ESPRESSO, the teams use tools such as Intel compilers, libraries, Intel VTune and OpenMP in their work. The goal is to incorporate the changes they make to Quantum ESPRESSO into the public version of the code so that scientists can gain from the modification they have made to improve code optimization and parallelization without requiring researchers to manually modify legacy code.
Electrical conductivity of a PDI-FCN2 molecule
One example of how Cineca used Quantum ESPRESSO to study a real device of promising scientific and technological interest is the electrical conductivity of a PDI-FCN2 molecule. This study was conducted by Cineca in collaboration with the University of Bologna and the National Research Council of Italy – Institute for Nanoscience (CNR-NANO). The object of this study is a two-terminal device based on a PDI-FCN2, a molecule derived from perylene. This system is important for the study of the electron transport in single-molecule devices and the further development of a new generation of organic field-effect transistor (OFET). The simulated system is composed of two gold electrodes, each of them made by 162 golden atoms. Between the electrodes, there is a PDI-FCN2 molecule. The system is made of 390 atoms and 3852 electrons. The metallic nature of the leads also requires a fine sampling of the Brillouin Zone in the description of the electronic structure. This further increases the computational effort required to simulate this system. Figures 1 and 2 show the molecular structure and the results of the study.
- Read more: Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization
The quantum mechanical solution of the electronic problem for such a huge system is a big challenge and it requires a large HPC computational infrastructure, like the one available at Cineca, and all the scaling properties of Quantum ESPRESSO.
Dr. Carlo Cavazzoni (Principal Investigator at the Cineca Intel Parallel Computing Center) states, “Based on the results obtained by this study, we will gain a deep understanding in the intimate conduction mechanisms of this type of organic devices, going a step forward in the direction of utilizing the new OFET technologies that soon will replace the traditional silicon devices. Quantum ESPRESSO and new supercomputing facilities will make possible our studies and the understanding of the physics of the devices that, in the future, will be the building blocks for new photovoltaics cells, next-generation displays and molecular computers.”
Cineca supercomputer
The Cineca team currently does their research using the Cineca FERMI BGQ supercomputer and an IBM NeXtScale cluster named Galileo based on Intel Xeon processors and Intel Xeon Phi coprocessors (768 E7120P). Cineca’s next HPC computer will be a Lenovo named Marconi with 2 PFlops with Intel Xeon processors E5 v4 in a first stage and 11 PFlops with the next-generation Intel Xeon Phi processors before fall of 2016. A third-stage system will include 4.5PFlops with future Intel Xeon processors and integrate an Intel Omni-Path interconnection.
Cineca Quantum ESPRESSO code modifications
Cineca is engaged in many R&D projects relating to HPC developments. One of the most important is MaX, which is a center of excellence funded by the European Community, whose ambitions are to establish an infrastructure for the material scientists and to support the development of codes toward the exascale. According to Cavazzoni, “We always focused our work on the part of numerical algorithms on the Fourier transform (FFT) and on the linear algebra modules. We started to rethink the overall organization of the memory hierarchy and parallelization structure. In particular, we modified the code in order to implement a sort of hierarchical tiling of data structures. In order to do so, we had to deeply modify the distribution of the data structure in Quantum ESPRESSO. The following figure shows the high level QE hierarchy.”
Cineca is tiling data structures to efficiently use the computing power of each node. They changed fine-grain parallelism in the QE FFT module by refactoring data distribution using task groups as shown in Figure 4. The move to a many core model required changing QE code to make it fit the structure of a single node efficiently and splitting QE code into intra-node and inter-node processes. In their work with the new data layout, a single TILE of processes, inside a given taskgroup, contains all the G-vectors and subset of bands to compute full 3-D FFTs. The data tiling can be changed to best match the HPC system characteristic, to the limit (if node memory permits) of having a whole 3-D FFT performed by a single taskgroup locally to the node.
The following example shows the results of a Car Parrinello simulation on a system of 32 water molecules. This plot shows the differences between the old implementation (blue) and the new one (red), enhancing a reduction of the time-to-solution. Different taskgroups distributions are shown in the plot. Simulations were obtained running on an Intel Xeon processor E5-2630 v3.
Cavazzoni indicates, “The Intel exascale road-map allows for a smooth innovation path in the code, and a constant improvement of the performance and scalability. The availability of a large number of cores per node has made it possible to tune the different layers of parallelization. A good tiling of the different data structures permits us to efficiently tile the memory and computing power of each node, reducing the amount of communication and, thus, enhancing the performances. We changed the fine grain parallelism of QE and, in particular, the FFT module. Adopting different kind of data distribution (taskgroups) we achieved a good improvement in terms of performance (Figure 5). However, there is still room for improvement, in particular for the efficiency of the OpenMP multithreading that is now limited to 4-8 threads. This is because workloads that are too small can induce load unbalancing and then a large spinning time. Adopting OpenMP tasking strategies, we are expecting a considerable improvement of the shared memory parallelism based on the new task level parallelism which is implemented in OpenMP4. We have already done some tests that make us think that we can remove the bottleneck displayed by synchronous thread level parallelism.”
LBNL/NERSC Quantum ESPRESSO code modifications
The main focus of Lawrence Berkley National Lab (LBNL) working with the National Energy Research Scientific Computing Center (NERSC) is to advance the open-source quantum chemistry or materials codes on multicore high-performance computing systems. They are jointly optimizing a variety of codes including NWChem and Quantum ESPRESSO code. NERSC is a national supercomputing center that serves the supercomputing mission and data needs of the U.S. Department of Energy Office of Science. NERSC is part of the Lawrence Berkley National Laboratory adjacent to the University of California campus. NERSC is also experimenting with modifying Quantum ESPRESSO code, since it is one of the most commonly used codes on NERSC systems.
According to Taylor Barnes, LBNL Hopper Fellow, “In particular, we are interested in improving the performance of hybrid Density Functional Theory (DFT) calculations within Quantum ESPRESSO. Hybrid DFT is often more accurate than other types of DFT, and can be especially important for performing simulations of systems like batteries and photovoltaic cells. Unfortunately, hybrid DFT is also much more computationally demanding and, thus, many of the calculations that we would like to perform are difficult or impossible to run on current machines.”
“In particular, we are interested in improving the performance of hybrid Density Functional Theory (DFT) calculations within Quantum ESPRESSO. Hybrid DFT is often more accurate than other types of DFT, and can be especially important for performing simulations of systems like batteries and photovoltaic cells. Unfortunately, hybrid DFT is also much more computationally demanding and, thus, many of the calculations that we would like to perform are difficult or impossible to run on current machines.”
One of the LBNL/NERSC strategies for improving the performance of hybrid calculations in Quantum ESPRESSO has been to refactor and modify the hybrid sections of the code. Barnes states, “In doing so, we have made significant changes to both the communication and parallelization strategies, leading to large improvements in the code’s strong scaling efficiency.”
Another focus of the LBNL/NERSC efforts is the investigation of improved ways to handle the parallelization of the fast FFTs, which are an integral part of any calculation in Quantum ESPRESSO. “FFTs are notoriously difficult to parallelize efficiently across nodes; as a result, we are exploring strategies for distinguishing between intra-node parallelization of the FFTs using OpenMP and inter-node parallelization of other portions of the calculation using MPI. Our expectation is that these changes will be especially important on Intel Xeon Phi architectures,” indicates Barnes.
How HPC will aid quantum materials research in the future
Cineca, LBNL and NERSC all have a vision of how improved HPC code and Intel processors and coprocessors can improve the future of quantum materials research. The work these groups are doing to modify code to take advantage of HPC parallelization and optimization is especially important because there are not enough software engineers to adapt legacy codes. The work they are doing is being reviewed, and the optimization and parallelization modifications made by Cineca have been approved and incorporated into Release 5.3.0 of the Quantum ESPRESSO code.
LBNL/NERSC work toward future exascale HPC systems
Both the LBNL and NERSC teams are active in the Intel Xeon Phi User’s Group (IXPUG) and in the exchange of information and ideas to enhance the usability and efficiency of scientific applications running on large Intel Xeon Phi coprocessor-based high performance computing (HPC) systems.
NERSC will be getting a large next-generation Intel Xeon Phi-processor-based supercomputer known as Cori late in 2016. NERSC has launched the NERSC Exascale Science Applications Program, which will allow 20 projects to collaborate with NERSC, Cray and Intel by providing access to early hardware, special training and preparation sessions. Project teams, guided by NERSC, Cray and Intel, will undertake intensive efforts to adapt software to take advantage of Cori’s manycore architecture and to use the resultant codes to produce path-breaking science on an architecture that may represent an approach to exascale systems.
Cineca work toward future exascale HPC systems
Cavazzoni states, “In the context of the MaX project, we are committed to work on different codes from the community of material science in order to get ready for the exascale challenges. One of our main targets is to contribute to the modularization of such codes in order to build domain-specific libraries to be usable in different codes and/or complex workflows as LEGO blocks. This high degree of modularization will also allow our team to increase the performances and the suitability for new incoming architectures. In QE, we are already performing this work, and we recently packed all the functionalities related to the FFT kernels in a specific library. We are doing similar work for the linear algebra (such as diagonalization and eigenvalue problems) kernels. Together with MaX, we are also exploring new parallel paradigms and their possible usage in QE. In particular, we are interested in the tasking strategies implemented in the OpenMP standard.
“We are committed to work on different codes from the community of material science in order to get ready for the exascale challenges. One of our main targets is to contribute to the modularization of such codes in order to build domain-specific libraries to be usable in different codes and/or complex workflows as LEGO blocks.”
The advent of the Intel Xeon Phi architecture platforms gave us a strong motivation to increase the level of exposed parallelism in QE. Working on this aspect brings us much closer to the exascale scalability. The Intel Xeon Phi architecture clearly tells us that what will make the difference is the ability to use the shared memory paradigm and node resources best. We need to allow the allocation of a single MPI task per socket, where the best ratio today for MPI/threads is 1/2, 1/4, quite unlikely 1/8, and nothing above. We should improve the shared memory efficiency to have the possibility to use MPI to threads ratio in the order of 1 to 32 at least. And this will be valuable for any architecture, not only for the Intel Xeon Phi processor. All these enhancements will be soon tested on the upcoming Intel Xeon Phi processors that will be available this year in new supercomputers.”
Other articles in this series covering the modernization of popular chemistry codes include:
- Modified NWChem Code Utilizes Supercomputer Parallelization
- Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization
- Optimized, Parallelized Software Enhances Quantum Chemistry Research
- DL_POLY Molecular Dynamics Code Changes Improve Optimization and Parallelization
References
- http://www.hpc.cineca.it
- www.quantumespresso.org
- http://www.max-center.eu
- https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
- https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-lawrence-berkeley-national-laboratory?
- http://www.nersc.gov
- http://www.nersc.gov/users/computational-systems/cori/nesap
Linda Barney is the founder and owner of Barney and Associates, a technical/marketing writing, training and web design firm in Beaverton, OR.
R&D 100 AWARD ENTRIES NOW OPEN: Establish your company as a technology leader! For more than 50 years, the R&D 100 Awards have showcased new products of technological significance. You can join this exclusive community! Learn more.