Cool NERSC Experiment Really Pays Off in Energy Savings
Custom cooling distribution unit built on commodity hardware delivers energy and space savings
The U.S. Department of Energy’s National Energy Research Scientific Computing Center (NERSC) is one of the largest facilities in the world devoted to providing computing resources and expertise for basic science research to nearly 4,000 researchers from around the globe. To facilitate this research, the center houses a range of HPC systems — including a new 1,120 node system that serves as a combined high performance computing cluster and a scientific cloud computing testbed. The system was installed last year to replace two existing clusters and to support an American Recovery and Reinvestment Act project, called Magellan, that explores whether a cloud computing model could benefit needs of scientists.
Because our primary goal at NERSC is to maximize the scientific output of users, we consistently try to limit system downtimes that disrupt their workflow. So, when it was time to install the system’s highly-dense servers, our implementation team resolved to deploy an “orderly switchover” — meaning that the new systems would have to be up and running before the existing clusters were decommissioned.
This strategy created some significant challenges for our implementation team, as we had to cost-effectively install a system that would push the boundaries of energy and space efficiency. We knew that the most effective way to tackle all of these challenges would be making them a hardware priority — a decision that led us to choose high-density iDataplex cluster. Once the decision was made, we worked with engineers from IBM and Coolcentric to build a custom cooling distribution unit (CDU) that leveraged both the commodity hardware and NERSC’s computer room to maximize energy and cooling efficiency for our facility.
By pursuing an aggressive systems strategy, we were able to decrease the cluster’s PUE to 1.15; install a Terascale cluster (95 TF) in 465 square feet of space, that includes supporting switches, disk and air handling units; and achieve smooth transition of operations.
Aggressive strategy for staying cool and energy-efficient
There were several factors that contributed to our decision to procure the iDataplex hardware. In addition to compute performance, the system’s rack design would contribute to significant space, power and cooling savings for the entire NERSC center. We also knew that we could cost-effectively tweak this commercial hardware to meet our unique needs.
The system’s racks contribute to huge space savings by turning a standard two-foot by four-foot rack 90 degrees to create a doublewide rack with half the depth. This design allowed us to install 80 nodes per rack, though each rack can hold a maximum of 84 nodes. Another appealing feature of this shelf is its two-rack unit (RU) enclosure and fans, which incorporates both power and fans into the rack. This innovation removes the need to include these components in individual nodes, which reduces the size of the servers to 19 inches deep. This feature also allows us to save space and energy, as we can use higher-efficiency power supplies and fans, which are shared by two nodes.
In our aggressive systems approach, we also opted for a Rear Door Heat exchanger (RDHX), which is a massive water-cooled door that provides both hot air containment and exit air temperature that is suitable to cool the next rack. These doors rely on the rack’s fans to push hot air through its cooling coils. Because of the door’s size, hot air moves slowly past the coils, allowing more time for the heat exchange between air and water. This lets us run the systems’ fans just above idle speed, even when the cluster is fully loaded! This also results in energy and noise savings, as the cluster is the quietest system in our machine room.
Figures 1 (upper) and 2 (lower) Courtesy of Brent Draney
Meeting custom cooling needs
Although the new Magellan systems came with built-in energy- and cooling-efficient features, our implementation team wanted to leverage this technology even further to pass the savings on to our entire facility. So, we evaluated the cooling architecture of all our systems and noticed that the return water heading to our chillers averaged about 52 degrees, which was actually still cold enough to cool the cluster.
In the game of data center efficiency, the easiest way to achieve power savings is to increase the temperature of return water heading to the chillers. Raising the chilled water temperatures will reduce the amount of electricity used by the cooling system, which equals a net savings for the facility. At NERSC, we realized that we could raise our return water temperatures by using the 52 degree Fahrenheit (F) return water in our main chilled water loop to cool the cluster. Once we had a design, we collaborated with engineers from Coolcentric to build a custom cooling distribution unit (CDU) that would allow us to do this.
A standard CDU for the cluster attaches to a center’s main water loop, and brings the warm water through a heat exchanger before circulating it through the water-cooled doors. This method allows the CDU to adjust the pump speed to the doors according to the heat load, and ensures that dew does not build up on the doors. After verifying that the water in our main chilled system met the quality requirements of the doors, we worked with Coolcentric engineers to remove the heat exchangers from the standard CDU. Instead, we had the pumps carry return water directly to the doors and installed a mixing valve to control the temperature of the return water flowing through it. This mechanism ensures that the doors stay above dew point.
We also installed a pump control algorithm that uses the door’s return water temperature to adjust the pump speed. This allows the CDU to dynamically adapt to the changing system load and maintain a constant rack exit temperature two degrees lower than the setpoint. Return water from the doors is then directed into the main chilled water return at 67 degrees F, which allows us to increase temperature for the entire main chilled water loop. This configuration is remarkably simple and increases the cooling capacity of the CDU by 66 percent over the standard with the same pump energy. Because we are essentially using return water to cool multiple systems before sending it to the chiller, without incurring any increased pump demand from the main plant, we see tremendous energy savings.
|NERSC Magellan supercomputer racks during installation at Oakland Scientific Facility (OSF) Courtesy of Lawrence Berkeley National Laboratory (LBNL)|
Customizing for space savings
To limit disruptions to our users, our team wanted to get 1,120 nodes up and running before decommissioning two aging supercomputers. Because we saw this as an opportunity to push the boundaries of hardware density, our solution was to install the racks at a five-foot pitch, leaving only 30 inches of aisle space between cabinets. A pitch is the area from the front of one cabinet row to the front of the next in our orientation. At the time, this was the densest iDataplex installation ever.
Many data centers use hot and cold aisle separation for energy savings (Figure 1). However, because we opted for the water-cooled doors, we were able to achieve complete containment. Our team then leveraged the hardware to create a single-pass airflow (Figure 2) — after the warm air passes through the RDHX doors and is cooled with return water, it is passed through to cool the row behind it.
|Magellan Computing System Courtesy of LBNL|
The doors have a dual pass coil that carries the chilled water through the outside coil first, cooling the air as low as possible. As the water warms up, it is passed through the inside coil, where the water is still cold enough to cool the air contained in the rack. This system keeps the Delta-T (change in temperature) across the door approximately 40 degrees and feeds into the pump algorithm, which dynamically controls the entire CDU that supplies the RDHX with chilled water. This design is so efficient that, when air finally exits the last row of the cluster, it is capable of being several degrees cooler than when it first entered the system. And, as a result, we can raise the ambient room temperature to anywhere between 65 to 75 degrees, which also contributes to energy savings.
While such an approach has operational benefits, reducing our energy consumption also underscores NERSC’s role as the primary HPC center for the U.S. Department of Energy. The research conducted by NERSC’s users investigates challenges ranging from more efficient combustion to new solar materials to future energy sources to understanding the effects of energy consumption on our environment. In short, energy research is at the heart of NERSC’s mission.
Brent Draney leads NERSC’s Networking, Servers and Security Group. He also managed the new Magellan cluster implementation. He may be reached at editor@ScientificComputing.com.