Simulating the World
Once an area of a few researchers with large budgets, high performance computing has evolved
The use of computer-driven simulations has become a widespread research tool enabling massive advances across a multitude of industries. Simulation has become an
accepted, cost-effective, low-risk way to investigate solutions without having to build the actual target functionality. Aircraft design, military applications and large systems such as submarines and power stations have all benefited from the use of computer simulation. It was an obvious choice for several reasons:
• Training and testing in the actual craft is far more expensive than conducting them on a simulator.
• Componentized testing for systems is much more efficiently done on system simulators than on the actual craft.
• Testing scenarios such as wind shear crash scenarios, failures of multiple hydraulic systems, icy runway landing conditions and nuclear reactor instability scenarios can be far too dangerous to actually perform in real life.
While the U.S. government has mandated the use of simulation for the aircraft industry, other areas of engineering have embraced simulation for hard economic reasons:
• Aircraft manufacturers found that high fidelity simulations of airfoil design would allow the first manufacturing run to deliver a finely tuned and efficient wing.
• Computer simulation also significantly reduces the number of hours spent in the wind tunnel, which used to require several modifications on the production line.
• Automobile manufacturers have dispensed with clay modeling and wind tunnel testing in favor of computer models that create the look and aerodynamics they want.
• Nuclear bomb tests have moved from above-ground testing in the desert to inside a computer for very obvious environmental, health and cost reasons.
• Battlefield simulations are increasingly being used to allow troops to be trained in combat situations and to make decisions under pressure before ever seeing the field.
All of these requirements for simulation have caused the high performance computing (HPC) market to change. What was once an area of a few researchers with large budgets working with proprietary systems that were optimized for one task, has now evolved to many researchers working with constrained budgets, using commodity or standards-based systems. This market dynamic, combined with massive advances in processing capabilities from commodity processor vendors such as Intel and AMD, has resulted in many industries relying on high performance computers for their simulation needs.
Computers that leverage commodity processors come in two forms, clusters and large SMP/NUMA-based systems. Clusters, which comprise a large number of stand-alone nodes connected via a network, are handed discrete or interactive jobs orchestrated by a software broker. SMPs, however, behave as one coherent region under the management of a single operating system. In this case, the processes are allocated to processors via the operating system.
The problem with SMPs
The secret sauce of a large SMP that has a large number of CPUs (16, 32, 64, and so forth) is its hardware-based cache coherency model. Cache coherency allows multiple CPUs to share a single memory space under the management of a single operating system. The amount of processing cycles that are required to service the cache
coherency exponentially increases with the number of CPUs inside the SMP. Performance of the SMP is highly dependent on the application running and on how much incumbent memory management and parallelism efficiency is built into the application. In short, the cache coherency models often don’t deliver expected performance. As a result, SMP efficiency is totally dependent on the code being run. SMPs with large cache-coherent regions require so much CPU and memory attention that the effect of adding more processors delivers diminishing returns as the system grows.
The value of the SMPs is the shared memory programming environment. However, in many cases, this is offset by poor execution performance. This is why many owners of large SMP systems typically use ‘slices’ of the SMP to minimize the extent of cache coherency domains and disable cache coherency across the entire system. Simply put, SMP architectures do not scale and simulation builders have recognized that clusters are the best mechanism to extract the processing value of commodity processors.
Clusters also face challenges
While clusters present the promise of being able to improve distributed scalable computing, the biggest challenge is also the characteristic that defines them. In order to get the maximum output from a large set of independent compute resources, the communications between them must be very low latency and must never create a performance bottleneck. Unfortunately, current adjunct interconnect mechanisms are limited in bandwidth and cause programmers to have to cope with limited connectivity inside a cluster. IBM and the U.S. Department of Energy recognized these limitations and embarked on the Blue Gene project to overcome performance and scalability issues with adjunct interconnects.
The average cluster node is simply a general purpose server that must satisfy many different design goals. Today’s server incumbents must ensure that their server designs
perform well as a file/print, Web and database server, all at a reasonable price. The traditional server design is a compromise across all of these potential roles. This results in mediocre server performance that does not service a high-performance application very well at all.
The traditional server PCI bus is a single bus that must service the needs of the interconnect host channel adapter, the video card or any other card needed inside the server. Even four-way high-performance servers are throttled by the general purpose bus through which they are forced to view the network. The performance bottlenecks within traditional clusters are a constant frustration to MPI programmers, as forcing the resulting applications on top of the compute infrastructure significantly throttles the applications. Developers typically need to spend as much time with their codes to alleviate communications congestion around hot spots within the cluster network, as they do solving scientific or engineering problems. Besides being a distraction from core HPC research, it further reduces code portability to other standards-based computing environments.
Solution
High performance computing applications need a native interconnect-driven server architecture that merges computing and communications resources into a powerful, yet standards-based system platform. This converged system should be optimized to bring the full power of standards-based interfaces like HyperTransport and leading edge processors directly into an integrated communications network. Once compute and communications characteristics are manageable as integral elements of the system, today’s HPC bottlenecks are removed and a brand new set of computing characteristics emerges.
Copious amounts of affordable inter-processor communications remove the need to code to restrictive low-bandwidth levels. Bandwidths of over 10 GB/s are enough to remove the programmatic restrictions currently experienced by many HPC users. The bandwidth must be non-blocking with a constant bi-sectional bandwidth (CBB) of at least 1. This means that any processor is guaranteed full bandwidth to any other processor, regardless of any traffic occurring at the time between other processors, and, more importantly, the communications performance is independent of the processor locality in the network. This is rarely achieved in traditional clustered environments and causes unpredictable performance results, especially when collective calls are made over a production cluster.
Many interconnects deliver a CBB of between 0.1 to 0.3 due to the exorbitant cost of layered switch setups required to achieve CBB=1 for bandwidths greater than 2 GB/s. This only guarantees that any two processing nodes that are located at either end of the network may only be given 10 to 30 percent of the maximum node-to-node bandwidth available due to traffic overhead. The result is congestion and associated congestion-based latency increases. This kind of architecture again forces the computational scientist to develop algorithms with processor locality in mind. To eliminate communications bottlenecks, a high-performance, fault-tolerant interconnect with CBB=1 and a routing pattern that is based on an intelligent load balancing algorithm versus a source routing algorithm is the key to achieving high-bandwidth, low-latency communications. An interconnect-driven server architecture makes this intelligent system balancing transparent to the user.
Traditional server nodes have not had any ability to set input/output (I/O) quality of service (QoS) for specific applications. Port-based configurations are possible on SAN switches but are statically defined and very inflexible. Converged platforms allow for definition of QoS at the I/O layer as I/O capacity can be pooled and then allocated to specific applications based on priority needs. For example, 70 percent of the available I/O capacity can be assigned to one application with three other applications each
assigned 10 percent of the remaining I/O capacity. As more I/O capacity is added, it can be assigned the corresponding ratios though software control. This I/O QoS can be defined from the I/O gateway to the individual compute nodes or to a virtual machine. The QoS parameters can be defined in software and can be changed on a dynamic basis. Policy-driven resource management enables system administrators to properly govern the shared resources under heavy load or when component faults occur. This is a central design characteristic of an interconnect-driven server architecture. The result is a balanced system where I/O utilization can be engineered rather than random use depending on the characteristics of the applications.
Another challenge facing clusters today is the significant loss of CPU memory resulting from protocol memory overhead needed to support large cluster node counts, and loss of CPU power due to the need to use processor cycles to manage the interconnect. Inefficient memory usage is due to dedicated cluster node buffers, which are traditionally allocated to each node-to-node connection. Converged systems can share buffer communication architectures to create one pool of buffer resources for interconnected nodes and yield a much more efficient memory resource usage than traditional systems. In addition, the communications component of the converged system is optimized to minimize the CPU load required to service the interconnect. The result is that more memory and CPU cycles are available to service the HPC application.
Clusters are a loosely coupled distributed set of systems that are renowned for a lack of management capabilities. Most of the nodes are stand-alone operating systems with very little management software in place. This is due to the high cost of third-party management tools, the lack of specialized cluster management tools and the difficulties in managing heterogeneous environments. An interconnect-driven server architecture provides a ‘big brother’ function that watches for problems in both the compute and interconnect components, and provides continuous real-time diagnostics. This preemptive management and control watchdog only can be present in a converged system where the compute and communication resources operate in optimized harmony with known operational measurements.
Future predictions
Traditional SMPs and general-purpose compute nodes are yielding to the groundbreaking capabilities of the interconnect-driven server architecture. This architecture will deliver on the inherent benefits of computing and communications convergence, since it can truly exploit the expanding capabilities of today’s microprocessors. The converged system will extend the speed and low latency of HyperTransport to other compute nodes through a dedicated, embedded interconnect. It also will improve the scalability issues associated with traditional hardware-based cache coherent memory management. Developers also can expect MPI, UPC and other parallel languages to be highly optimized for their interconnected application requirements. The embedded management of an interconnect-driven server system will yield a homogeneity and lower cost of ownership computing model that has not been seen previously.
Mike Kemp is chief technology officer and co-founder of Liquid Computing. He can be reached at [email protected].