Speeding up Molecular Dynamics: Modified GROMACS Code Improves Optimization, Parallelization

This is the second article in a series covering how researchers from national laboratories and scientific research centers are updating popular molecular dynamics, quantum chemistry and quantum materials code to take advantage of hardware advances such as the next-generation Intel Xeon Phi processors.

Molecular dynamics research requires use of high performance computing systems and specialized HPC software to perform highly compute-intensive calculations to generate movies of molecular motions. “In the simulation of biological and chemical molecular systems, we can never get enough computer power. More power allows us to simulate ever larger and more complex systems, more accurately, and for longer periods of time,” states Dr. Jeremy C. Smith, Governor’s Chair and Director, University of Tennessee/Oak Ridge National Laboratory, Center for Molecular Biophysics.

GROMACS is a versatile molecular dynamics software package which is used to simulate the motion of atoms in materials and biological systems, such as proteins. It is the most highly utilized molecular dynamics program in recent years, because it can calculate the motion of the atoms very quickly in a matter of microseconds — this means that it takes less computer power for calculations using GROMACS. Researchers running simulations on GROMACS can also simulate large systems, such as many proteins interacting with each other.

Read more: Modified NWChem Code Utilizes Supercomputer Parallelization

GROMACS code runs on systems ranging from desktop computers up to the world’s largest supercomputers. There are approximately 3,000 scientific articles per year relating to GROMACS research, and thousands of groups run the code. GROMACS is employed for true, enterprise-level science in socially important fields such as drug design and materials research. Intel processors are suited for fast simulations and Intel Xeon processors are often used with GROMACS code because it makes good use of system resources.

Introducing the University of Tennessee and Oak Ridge National Laboratory JICS Intel Parallel Computing Center

The Intel Parallel Computing Center at the Joint Institute for Computational Sciences (JICS) between the University of Tennessee (UT) and Oak Ridge National Laboratory (ORNL) is a leader in optimizing applications for various HPC software codes. The University of Tennessee has a close working relationship with Oak Ridge National Laboratory (ORNL), and they share post-doctoral students and joint faculty who work jointly with ORNL. One of the most powerful supercomputers in the U.S. is located at ORNL and used for key missions, such as computing and energy-related research.

In 2011, JICS established the Application Acceleration Center of Excellence (AACE) in partnership with NICS, Cray and Intel. The goals of these groups are to optimize applications for current and future computing systems, develop expertise in the expression and exploitation of fine-grain and medium-grain parallelism, and to guide the development of future supercomputing architectures and programming models through co-design activities with HPC vendors and manufacturers including Intel. As part of the goal to optimize applications, Dr. Jeremy C. Smith is leading a project team working to include support for the Intel Xeon Phi coprocessor architecture in the official, publicly available version of GROMACS and to carry out the requisite optimizations to ensure world-class results.

Studying plant structure reaction to bioenergy pretreatment process

The University of Tennessee / ORNL team use GROMACS in areas such as early stage drug discovery, bioenergy and second generation cellulosic ethanol biofuel research. According to Smith, “To discover new drugs, we use GROMACS to simulate the shape changes of sites in drug target proteins to which candidate chemicals might bind, then screen millions of compounds virtually to find the strongest binders. Doing this, we have been able to successfully predict chemicals that bind to 14 different drug targets that were verified experimentally.”

The ORNL Center for Molecular Biophysics also studies ways of converting plant matter into biofuels. Combining supercomputer simulations with neutron scattering experiments, they identified fundamental effects that change plant cell-wall structures during pretreatment processes used in the production of bioenergy, the most expensive stage of biofuel production. Pretreatment subjects plant material to extremely high temperature and pressure to break apart the protective gel of lignin and hemicellulose that surrounds sugary cellulose fibers.

The team revealed unexpected findings showing that, as the biomass heats up, cellulose fibers dehydrate. This process of dehydration causes the cellulose fibers to move closer together and become more crystalline, which makes them harder to break down. The researchers also analyzed two polymers, called lignin and hemicellulose, that bond to form a tangled mesh around the cellulose bundles. The simulations showed that the two polymers separate into different phases when heated during pretreatment. Understanding the role of these underlying physical factors — dehydration and phase separation — should enable scientists to engineer improved plants and pretreatment processes and ultimately bring down the costs of biofuel production.

Supercomputers Used in the Research

The University of Tennessee / ORNL team used the Beacon and Babbage HPC systems for their Intel Xeon Phi coprocessor optimization work. The Beacon supercomputer, located at UT/NICS, is an energy-efficient cluster that utilizes Intel Xeon Phi coprocessors. It is funded by the National Science Foundation (NSF) through the Beacon project to port and optimize scientific codes to the coprocessors based on Intel’s Many Integrated Core (MIC) architecture. The Beacon HPC system contains 48 compute nodes each with two Intel Xeon processors E5-2670 and four Intel Xeon Phi coprocessors 5110P. Beacon also has the distinction of being the first HPC system to deliver 2.499 GigaFLOPS/Watt with 71.4 percent efficiency. The Babbage HPC system is located at NERSC and contains 45 compute nodes each with two Intel Xeon processors E5-2670 and two Intel Xeon Phi coprocessors 5110P.

Improving GROMACS code to use advanced hardware features

The University of Tennessee/ORNL team is at the forefront in modifying GROMACS code to improve the performance on today’s advanced hardware. Much of the work will improve parallel computing so that many calculations are carried out simultaneously without delays or communication bottlenecks. The focus of the center will be to take supercomputing to the next level to meet scientific computing demands. The porting and optimization of GROMACS code for next-generation machines is expected to catalyze advances for today’s scientists and tomorrow’s industry.

The University of Tennessee/ORNL team optimized GROMACS in a variety of ways. John Eblen, postdoctoral researchers states, “GROMACS provides an abstraction layer for single instruction multiple data (SIMD) vector operations, which helps immensely in adding vector operations for new architectures. So, we were able to add support for Intel Xeon Phi coprocessor SIMD vector instructions fairly quickly. By using these Intel Xeon Phi coprocessor-specific operations, speed improved 3.6 times. With the aid of Intel VTune analyzer software, we found several small, serial sections of code that were suddenly dominating the run time for large numbers of cores. These were previously considered to be quick housekeeping tasks where parallelism was not needed. For large numbers of cores, though, these tasks become a drag on performance. We added parallelism to these tasks, which led to an 86 percent improvement in run time. Algorithmic changes can also help. We changed our force reduction algorithm to a tree reduction to take advantage of cache locality. This led to a 1.9x improvement in the reduction time.”

Figure 2: Compute Core tasks associated with the Intel Xeon Phi coprocessor. Courtesy of University of Tennessee/ORNL How GROMACS modifications take advantage of parallelism

GROMACS takes advantage of parallelism at three different levels on the Intel Xeon Phi coprocessor. In the example configuration shown in Figure 2, each box represents one compute core. There are 10 process-level (MPI) tasks, represented by the blue and yellow columns, which divide the atoms in the input system. Each MPI process is assigned six cores, and each core can run up to four threads (OpenMP) because of hyperthreading technology. Processes use these 24 threads to divide up small units of work, such as loop iterations. Finally, each core contains a wide single instruction multiple data (SIMD) vector unit that can process a single machine-level instruction, such as an addition, on up to 16 32-bit floating-point values at once.

Figure 3: How various GROMACS operations scale. Courtesy of University of Tennessee/ORNL Eblen indicates, “Having early access to new technology helps tremendously in improving GROMACS for future platforms. Specifically, we can discover and fix problems where the code no longer scales for high numbers of threads. As one example, Figure 3 shows an early test run to see how different parts of GROMACS scale. The figure displays the sum of run times per thread for increasing thread counts for various GROMACS operations. Ideally, the sum of run times per thread will remain constant, showing no overhead from dividing the operation among multiple threads. The listed forces, though, do not scale past 60 threads. Below 60 or so processes, the scaling looks fine, but now we were able to see clearly that something was amiss with the listed forces (green area). On further investigation, we uncovered and fixed a bug that was preventing that computation from scaling. We then could confirm that, with the fix, it scales well to over 200 threads.”

Future molecular dynamics research requires hpc hardware and software enhancements

Code enhancements that the University of Tennessee/ORNL PCC team make to optimize GROMACS code to support Intel Xeon Phi coprocessors are reviewed by a GROMACS committee of core developers. Code changes that are approved are loaded into the main GROMACS code branch for release in the next version.

The University of Tennessee/ORNL team believes that more improvements need to be made to HPC systems to enable future exascale research. For example, “Transparency is key for improving developer productivity. Developers would like to avoid optimizing code by slow, tedious trial and error. For example, Intel provides prefetching instructions for optimizing code, which allow the fetching of data prior to its use. However, optimizing code with prefetching is difficult, because we cannot see what is happening inside the cache. If we prefetch too early, values are evicted before they are used. If we prefetch too late, data will not arrive in time. Prefetching can also inadvertently hurt performance by evicting other needed data. So, optimizing with prefetching is a lot of trial and error. Intel VTune software offers unprecedented detail, such as information about cache misses and run times for individual assembly instructions, but these are only clues, not direct evidence. In an ideal world, developers could print a diagram of variables stored in the cache and variables pending arrival after every cycle inside a time critical loop. Then we could see exactly what is happening. Implementing this would be challenging for several reasons, including figuring out how to handle out-of-order instructions and how to inspect the cache without impacting its operation too much, but this type of detailed transparency is a worthwhile goal,” states Eblen.

Smith states, “In the future, we will want to simulate ever more complex molecular systems in greater and greater detail. Eventually, we will need to simulate a whole living cell at atomic detail. This will require clever methods and algorithms to partition and coarse grain time and space on multiple length and timescales, to make sure all necessary but no superfluous information is used in any sub-computation, to build adapted hardware and to assign tasks to it in the most efficient manner.”

References

Linda Barney is the founder and owner of Barney and Associates, a technical/marketing writing, training and web design firm in Beaverton, OR.

R&D 100 AWARD ENTRIES NOW OPEN: Establish your company as a technology leader! For more than 50 years, the R&D 100 Awards have showcased new products of technological significance. You can join this exclusive community! Learn more.

Related Articles Read More >

Who has access to Claude Mythos-tier models (and beyond) will redefine cybersecurity, including in R&D

A startup says it found hidden memory behavior in NVIDIA GPUs and is building a security layer around it

NTT Research launches Scale Academy with SaltGrain, a zero-trust data security suite

LabWare advances SaaS LIMS strategy at Pittcon 2026, one year after ASSURE launch

Search R&D World