This is the fourth installment in the series covering how researchers from national laboratories and scientific research centers are updating popular molecular dynamics, quantum chemistry and quantum materials code to take advantage of hardware advances, such as the next-generation Intel Xeon Phi processors.
Molecular dynamics software, used to simulate the evolution of a system of atoms or other particles, enables researchers to determine the thermodynamic, kinetic and transport properties of various materials. This research tool is used for a wide variety of applications that include developing new and better therapeutics, designing new materials with better properties and improving the efficiency of molecular devices. Molecular dynamic research requires the use of high performance computing systems and specialized HPC software to perform the highly compute-intensive mathematical calculations and to simulate images of the molecular structures.
The DL_POLY molecular simulation software package is a widely used classical molecular dynamics (MD) simulation application developed at the Science & Technology Facilities Council (STFC) Daresbury Laboratory. The package is used to model the atomistic evolution of the full spectrum of models commonly employed in the materials science, solid-state chemistry, biological simulation and soft condensed-matter communities. DL_POLY 4 can be executed as a serial or parallel application. The code achieves parallelization that is suitable for homogeneous, distributed-memory, parallel computers.
ICHEC molecular dynamics research
The Irish Centre for High-End Computing (ICHEC) is the national HPC Centre for Ireland, with offices located both in Dublin and Galway, and is hosted by the National University of Ireland, Galway. ICHEC operates the national HPC service providing compute resources and software expertise for the research communities across all the main science disciplines through collaborative partnerships and programs of education and outreach. ICHEC is recognized as an Intel Parallel Computing Center (PCC) and focuses on optimizing DL_POLY and other software codes to be able to take advantage of hardware features in the Intel Xeon processor and Intel Xeon Phi processor-based systems.
According to Dr. Michael Lysaght, Head of Novel Technologies, ICHEC, “DL_POLY is used to study a wide range of molecular dynamics from solids to liquids to biomolecules, as shown in Figures 1 and 2. For our code modernization work on DL_POLY 4, we have focused on two use cases of interest to the material science community, namely an iron simulation at 300 K using NPT Berendsen ensemble with Finnis-Sinclair forces and no electrostatics (250,000 atoms) and 16 Gramidicin A molecule in aqueous solution at 300 K using NPT Berendsen ensemble with SPME and SHAKE/RATTLE algorithm for the constrained motion (792,960 atoms).
Modifying legacy code for modern hardware
Many of the legacy codes used in molecular dynamics research have been in existence for a long time, and the code is not designed to take advantage of features of new hardware, such as the Intel Xeon Phi coprocessor. As a national HPC center focused on supporting a wide range of communities and a huge number of applications, the ICHEC team finds that one of the main benefits of working with Intel Xeon processors and Intel Xeon Phi coprocessors is that they use common languages, models, and familiar and standard development tools, so there’s no need to learn new languages or tools, and performance-focused improvements are fully portable.
“ICHEC believes strongly in the huge impact that the Intel Xeon Phi processor can have on accelerating scientific discovery, particularly in the area of chemistry. Recognizing that the massive performance of the Intel Xeon Phi processor can only be unleashed through significant code modernization, we have focused heavily on preparing several codes across a wide range of domains for current and future generations of the platform. Much of our research and development work is driven by both national and international collaborative projects, including the European PRACE-RI, EU H2020 projects and ICHEC’s Industry Services Program. Working closely with Intel throughout, one of ICHEC’s aims is to continue to help our users, partners and clients to remain competitive through the use of current and future generations of Intel Xeon Phi processors,” states Lysaght. The progress the ICHEC team makes in optimizing DL_POLY code will be rolled back into the DL_POLY 4 version of the code so researchers can use the changes without needing to manually change their legacy code.
All the ICHEC development and testing on DL_POLY is carried out on the hybrid component of ICHEC’s Fionn cluster, a heterogeneous machine made up of four components: Thin, Hybrid, Fat and Service. The Thin component is an SGI ICE X system of 320 nodes or 7680 cores made of 2.4GHz Intel Xeon processor cores (formerly known as Ivy Bridge). Each node has 2×12 core processors, 64GB of RAM and is connected using FDR InfiniBand. This amounts to 20TB of RAM across the partition. The Hybrid partition contains 32 nodes, where 16 of these nodes have 2×10 core 2.2GHz Intel Xeon processors E2660-v2 with 64GB of RAM connected to two Intel Xeon Phi coprocessor 5110P cards.
DL_POLY code modifications
The early-stage efforts at ICHEC have been to enable and optimize DL_POLY 4 to run efficiently on Intel Xeon Phi processor-based systems. Efforts have focused on designing and implementing an efficient hybrid MPI/OpenMP version of the DL_POLY 4 code for the first time.
DL_POLY is a highly scalable pure MPI-based code with over 500,000 lines of code. The ICHEC team has evaluated this hybrid version of the code in native, offload and MPI symmetric modes, with a particular focus on the top three most time-consuming algorithms of the code for several real-world test cases that are of interest to the materials science community. These components include the calculation of ‘two body forces (TBF)’, ‘link cell pairs’ and ‘constraints’, all of which are found in other well-known molecular dynamics software packages. “Because these are general methods in this domain, we feel the methods we have applied as part of our focus on DL_POLY 4 will also have relevance to other codes in the community,” indicates Lysaght.
Modifications for calculating two body forces
The Two Body Forces (TBF) module is one of the most compute-intensive components of DL_POLY 4 (typically taking up more than 30 percent of wall clock time, but this can vary for different problem types). Optimizing the TBF module is where the ICHEC team focused most of their effort. The team implemented an OpenMP parallelization of the TBF stage targeting both Intel Xeon processors and Intel Xeon Phi coprocessors in one code base. The initial OpenMP parallelization scheme exploited OpenMP reduction clauses, which were found to significantly impede scaling over OpenMP threads on the Intel Xeon Phi coprocessor. The team implemented an alternative design of the TBF component, which avoids OpenMP reductions over arrays and which improves performance by a factor of over four times on the Intel Xeon Phi coprocessor relative to the out-of-the-box OpenMP implementation using the OpenMP reduction clause.
Modifications that ICHEC is making to TBF stem mainly from employing an alternative ‘direct’ means of calculating the two body forces, making much better use of the vector processing units (VPUs) on the Intel Xeon Phi coprocessor, such that the throughput of two body force calculations (TBFs/s) has improved significantly. The ICHEC TBF work focused on testing the Two Body Forces time-to-solution on an Intel Xeon processor (one and two sockets) and Intel Xeon Phi coprocessors 5110P and 7120P.
In Figure 3, the light shading indicates that a tabulated method was used for the two body forces simulations, while the light green shading indicates a direct evaluation of the molecular potential. For the Intel Xeon processor system, ICHEC used 10 MPI processes for one socket and 20 MPI processes for the two-socket case. In the case of the Intel Xeon Phi coprocessor test, the best performance is achieved by using 30 MPI processes each spawning enough threads to fully subscribe two cores each. In the case of the Intel Xeon Phi coprocessor, it can be seen how performance improves by using extra threads on each core, with four threads per core showing the best result on the Intel Xeon Phi coprocessor (shown by the orange bars).
- Read more: Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization
As can be seen in Figure 3, on the Intel Xeon Phi coprocessor 5110P, there is a greater than 23 percent (12 percent on the Intel Xeon processor) improvement in the TBF time-to-solution when ICHEC employed the direct means of calculating the two-body forces versus employing the tabulated means of calculating the forces, which is the approach most typically used in DL_POLY 4.
As an overall result of the ICHEC work focused on TBF, the throughput of the two body forces calculations on a single Intel Xeon Phi coprocessor 7120P (in offload mode) is now approaching the TBF/s throughput achieved when running on two Intel Xeon processors E5-2660 v2.
How HPC will aid future molecular dynamics research
The work that ICHEC has done to optimize DL_POLY code is available to other researchers to help save time in manually changing code. In collaboration with STFC, who are the main developers of DL_POLY, a version of the DL_POLY code that contains the ICHEC changes can be accessed by downloading a branch version of the code that is available on CCPForge.
However, there are areas where ICHEC believes that improvements need to be made to future HPC hardware and software to enable future exascale research. The ICHEC team sees issues across various projects that are difficult to address with the current HPC systems and codes. For example, many of the chemistry codes reveal workloads that are typically memory bound, as well as comprised of kernels that are often challenging to vectorize efficiently due to heavy branching and indirect array accesses. In addition, communication bound methods, such as distributed fast Fourier transforms (FFTs) will continue to place demands on internode communication at extreme scale.
“With these challenges in mind, we are happy to see the standard CPU form factor of the next-generation Intel Xeon Phi processor, the high bandwidth on package memory, as well as the support for Intel Omni-Path connectors, that will be part of the second-generation Intel Xeon Phi processor platforms. Continual improvements to the auto-vectorization capability of the Intel compiler will also be welcome with the improved analytical capability of Intel profilers, such as Intel VTune and the Intel Vector Advisor Tool, to aid with deeper performance insights.”
“In terms of national research priority areas, we are specifically interested in preparing DL_POLY 4 for the deep petascale/future exascale era and as a state-of-the-art tool for advanced materials researchers. Ireland has a particular interest in this area due to the presence of leading materials science centers, such as the Tyndall National Institute, AMBER, as well as being the headquarters for the E-CAM EU Horizon 2020 HPC Applications Centre of Excellence (of which ICHEC is a member). In close support of researchers in Ireland, our ICHEC team has focused on enabling DL_POLY 4 on Intel Xeon Phi processor-based systems to provide insight in the area of sustainable energy production, which is a grand challenge for society and an integral part of Ireland’s sustainable energy objectives and climate change strategy and where such energy production will require novel materials engineered with atomistic precision,” states Lysaght.
Other articles in this series covering the modernization of popular chemistry codes include:
- Modified NWChem Code Utilizes Supercomputer Parallelization
- Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization
- Optimized, Parallelized Software Enhances Quantum Chemistry Research
- CCPForge DL_POLY page: https://ccpforge.cse.rl.ac.uk/gf/project/dl-poly
Linda Barney is the founder and owner of Barney and Associates, a technical/marketing writing, training and web design firm in Beaverton, OR.
R&D 100 AWARD ENTRIES NOW OPEN: Establish your company as a technology leader! For more than 50 years, the R&D 100 Awards have showcased new products of technological significance. You can join this exclusive community! Learn more.