“In the upcoming new generation of systems, we will be dealing with hundreds, tens of thousands or even hundreds of thousands of nodes. It’s the fabric that will determine the performance of all these nodes. This is why Intel is making a significant investment in developing a next generation interconnect — the Intel Omni-Path Architecture. And this fabric — because it touches every element in the system — is a key element of Intel’s scalable systems framework.”
So began Joe Yaworski, Intel Director of Fabric Marketing for the HPC Group, as he discussed the recent Intel Omni-Path Architecture (Intel OPA) product announcement.
One of the first instantiations of that new generation of supercomputers will be Aurora, a powerhouse with a peak performance of 180 petaflops that will be located at Argonne National Labs. Intel, who was selected by the U.S. Department of Energy to be the prime contractor for this first of a new generation of supercomputers, will work with Cray who will build the actual system.
Aurora, a significant milestone on the road to exascale, is also an example of a major shift in Intel’s role in supporting the development of future HPC systems for users both large and small. The shift centers around the Intel Scalable Systems Framework. This is a flexible blueprint for developing high performance, balanced and efficient HPC systems that scale from the small to extremely large. It supports both compute and data-centric workloads, and includes all the essential building blocks ranging from next generation Intel Xeon and Intel Xeon Phi processors to the Intel solutions for the Lustre parallel file system.
Yaworski points out that, not only is fabric a key component of HPC clusters, but it also represents a large portion of the cluster’s costs, a price tag that is growing from generation to generation. Today, the fabric represents about 20 to 30 percent of the cluster’s cost. As we move toward EDR InfiniBand*, that number will rise to an unacceptable 30 to 40 percent. Compounding the problem is the fact that InfiniBand was not designed for HPC. InfiniBand Verbs, a communications library, provides a sub-optimal foundation for supporting protocols such as MPI. Its Quality of Service (QoS) features are ineffective in mixed traffic application environments.
Low latency, high priority MPI traffic can be held up by the introduction into the network of large block, low priority storage traffic. Also, InfiniBand’s transient error handling methodology falls short when it comes to meeting system performance demands. Given the speeds involved — 100Gb/s — it is critical that within the fabric itself there needs to be a methodology for fast, reliable error detection and correction. InfiniBand was never defined to have that capability.
A Full Fabric Solution
Intel Omni-Path Architecture will not only provide much needed advanced capabilities, but also return cost ratios to the 20 to 30 percent range. Intel is providing a full fabric solution that not only includes excellent price performance, but all the adapters, switches, cabling and software that its partners need to create custom design solutions. Included is a host interconnect running at 100Gb/s and a 48 port switching ASIC, as well as director class switches with 192 and 786 ports.
“We’re taking an evolutionary approach with revolutionary additions,” says Yaworski. “Intel Omni-Path Architecture incorporates the best features of our existing Aries and Intel True-Scale Fabric and introduces significant new features and capabilities to improve overall HPC performance, reliability and QoS. Our strategy is to build upon the OpenFabric Alliance (OFA) and OFA Enterprise Distribution to ensure compatibility and stability with existing applications and systems. This significantly reduces the risk of bringing new technology to the marketplace.”
He points out that Intel OPA is a performance enhanced architecture featuring: CPU/fabric integration, optimized host implementation and an enhanced fabric architecture.
- CPU/fabric integration: By integrating the fabric with Intel Xeon Phi processors and the next generation of Intel Xeon processors, Intel OPA achieves lower power, lower cost and greater density. Bandwidth is increased to each socket — for example the incorporation of the latest Intel Xeon Phi processors code name Knights Landing offers two ports with a total bi-directional bandwidth of 50GB. Overall, moving the integration closer to the CPU results in increased bandwidth and improved latency to each socket.1
- Optimized host implementation: Intel OPA software provides extremely high MPI message rate performance, a feature that supports peak performance and high application scalability, including maintaining low end-to-end latency at scale. The architecture is connectionless in that address mapping is stored in system memory. No connection state information is stored on the adapter’s cache, so there is no penalty for a cache miss.
- Enhanced fabric architecture: Intel OPA is tuned for high compute throughput. Each of the 48 switching ports on the switching ASIC can handle 195M messages/sec.1 This reduces the potential for congestion in the fabric and supports very high MPI message rates.
Port-to-port latency is very low — about 100 to 110 nanoseconds between any two ports within the switch1 — and this includes all error detection and correction latency, which is handled totally within the fabric itself. This avoids the latency associated with having the end node sending errors all the way back to the sending node for correction and retransmission. InfiniBand, on the other hand, uses a technology called Forward Error Correction (FEC), which can double the switch latency for every packet moving across the fabric.
Intel Omni-Path Architecture also provides Quality of Service (QoS) features that optimize the movement of data in a mixed environment consisting of high priority small MPI messages and large blocks of data, often from storage. Included are features like dispersive routing — load balancing traffic across multiple paths. Other advanced techniques are designed to minimize jitter.
Architected for MPI Applications
The Omni-Path Architecture is specifically designed for MPI applications, a mainstay of HPC. It incorporates a wide array of MPI functions and user-facing functions.
The Omni-Path Architecture is also designed to scale with today’s servers equipped with dense multi-core, multi-socket CPUs with faster processors, higher memory bandwidth, and more cores.
Intel OPA builds upon the best of InfiniBand from the Intel True Scale architecture. It uses the fourth generation of PSM (Performance Scaled Messaging), an optimized interface library layer between the upper layer protocol, like MPI and InfiniBand. MPI is used by 95 to 98 percent of all HPC applications. PSM fully understands the semantics that MPI uses for communications. This approach provides a lightweight solution that takes very few CPU cycles to handle an MPI communication. Because it’s lightweight with a very tight code path, PSM reduces latency and supports very high message rate performance.
Typical test results include MPI message rates of 115 Mmps unidirectional and 123 Mmps bi-directional. MPI bandwidth performance clocked in at BI-BW 23.4GB/s over 1 port uni-directional and 24.5 GB/s over one port bi-directional.1 Succeeding processor generations should increase these results up to the estimated architecture maximums of 160M messages/sec. unidirectional and 300M messages/sec. bi-directional.
Among Intel OPA’s advanced features is fine-grained control at the fabric link level. This feature supports traffic flow optimization, packet integrity protection and dynamic lane scaling.
- Traffic flow optimization: Intel OPA optimizes QoS in mixed traffic environment typically made up of large, low priority storage files contending for fabric resources with small, high priority MPI files. Fine-grained control allows the transmission of lower priority packets to be paused so the higher priority packets can be transmitted. The result is a faster time to solution and lower run-to-run timing inconsistencies.
- Packet integrity protection: This feature supports the rapid recovery of transmission errors on an Intel OPA link with low latency for both corrupted and uncorrupted packets. Fixes occur at the link level rather than at the end-to-end level, providing much lower latency than InfiniBand’s forward error correction feature.
- Dynamic Lane Scaling: Maintains link continuity in the event of a failure in one or more physical lanes by operating with the remaining lanes until the problem can be corrected.
Conclusion
Intel Omni-Path Architecture is designed for HPC. It is built for performance and scalability and its emphasis on QoS and reliability ensure that the architecture is one of the best fabrics on the market.
Intel OPA is optimized at each layer and is tightly integrated with the CPU. Over time, the fabric and the CPU will become indistinguishable. This means lower power costs, reduced latency, and higher density resulting from driving the fabric directly out of the CPU packaging. In addition, bandwidth to each socket will continue to increase as Omni-Path evolves, removing a bottleneck that was part of previous generation fabrics.
Intel OPA’s key role as one of the essential building blocks of Intel’s scalable systems framework guarantees that, for years to come, the fabric will be an integral part of HPC systems ranging from advanced desktop workstations to the world’s most powerful supercomputers.
Reference
1. Transforming the Economics of HPC Fabrics with Intel Omni-Path Architecture (8-page PDF)
John Kirkley is a Portland-based technology writer.