Optimizing Workflows in Globally Distributed, Heterogeneous HPC Computing Environments

This complex task requires significant software support

Optimization of workflows in a modern HPC environment is now a globally distributed, heterogeneous-hardware-challenged task for users and systems administrators. Not only is this a mouthful to say, it is also a complex task that requires significant software support. In the old days, job schedulers were only tasked with running jobs efficiently on relatively similar hardware platforms inside a single data center. Not that optimizing hardware utilization was a simple task back then, but current HPC users and systems administrators must now manage workflows that may include a mix-n-match collection of heterogeneous hardware like CPUs, GPUs and Intel Xeon Phi devices installed in multiple, distributed clusters that may be located within an organization, physically located across multiple data centers around the world, and may even include resources nebulously contained within the cloud.

IBM, for example, offers their Platform LSF suite of tools built on top of the well- known LSF job scheduler that has been a core component in HPC centers for many years. Platform LSF provides the IBM Plat- form Session Scheduler and IBM Platform Data Manager tools to create ‘virtual private clusters’ that can asynchronously run jobs on a local cluster, geographically distant cluster, or inside the cloud. Jobs running within these virtual private clusters need only communicate with the scheduler inside the virtual private cluster. This means users can submit large volumes of tasks within the virtual private cluster that are able to run asynchronously on the remote hardware without needing to wait for the main scheduler’s approval. In this way, IBM’s Platform LSF is able to avoid communications limitations and speed-of-light latency — even across long distances — to deliver extreme scaling within the job scheduler. Similarly, the IBM Platform LSF Data Manager is used to stage data across distributed clusters via localized smart caches to eliminate data access delays as much as possible.

Such software tools help in creating and running tasks that can scale to run in these asynchronous, geospatially distributed environments — even with the added caveats that the environment can dynamically change through the addition and removal of cloud resources and clusters. These same software tools help users and systems administrators optimize their workflows and job scheduling to efficiently utilize systems that contain massively parallel accelerators and coprocessors, as well as address more ‘mundane’ hardware differences, such as variations in memory capacity and CPU type.

In the IBM Platform ecosystem, work- flow creation is supported via a (graphical user interface) GUI that lets users draw the data flow and computational interactions. People interact much more naturally with a GUI, as it lets them graphically visualize the overall computational work and data flows. A well-designed GUI (and set of GUI templates) can abstract the workflow sufficiently so that script generators — much like a compiler for a parallel computer — can then create the scripts that contain the complex task and command invocations that implement the user workflow. Further, these scripts can be targeted to run on a specific hardware configuration (again much like a compiler generating code for multiple CPU architectures) be it for a local cluster or aggregation of multiple clusters and cloud environments containing a number of asynchronous ‘virtual private clusters.’

IBM Platform Analytics: Cluster utilization (jobs) over time Courtesy of IBM Optimizing resource utilization means the systems team needs to see what is happening inside their globally distributed, asynchronously running multi-cluster HPC environment in real-time — a non-trivial data collection and visualization task by itself. Further, both users and the systems management team need to be able to analyze the performance of the HPC center so users can improve the efficiency of their workflows over the short term, and both users and the systems team can collaborate on HPC upgrades and new procurements to improve efficiency over the long term.

Both real-time data acquisition and the analysis of aggregate HPC datacenter information are big-data tasks that might be larger than some of the scientific questions being investigated! Think of the amount of monitoring and profile data that can be generated by many thousands of nodes in real-time, or the amount of data that must be gathered and stored for later analysis from those same nodes over the lifetime of the hardware. However, targeted data-driven decision-making is an essential part of data center operations and the procurement process, be it for a new system, system upgrade or to quantify cost and runtime machine requirements when contracting with a cloud- based service.

Balance ratios, as discussed in my 2007 Scientific Computing article, “HPC Balance and Common Sense,” are a commonly used set of metrics that can extrapolate the characteristics of a newer, faster machine that can run a job mix efficiently based on the hard- ware characteristics of an existing system. The TOP500 site uses balance metrics based on synthetic benchmarks to compare systems. By extension, balance ratios and other metrics based on historical workload data for a site can be — and are — an invaluable tool for workload optimization and procurement planning. In short, balance ratios can distil a tremendous amount of ‘big data’ HPC performance data into a few numbers. They are but a few of the many analytic tools (many of which are not so concise) that can be used to analyze and optimize HPC data center procurements and operations.

Packages, such as IBM’s Platform LSF, are nice in that they provide an integrated from user to systems management team experience. Other robust and respected job scheduling packages such as SLURM are also available. The SLURM ecosystem of tools also provide a number of similar tools, including the ability to run applications in a distributed environment such as the Teragrid.

Alternative profiling and analysis packages also exist. One example is the free NWperf tool set discussed in my February 2015 Scientific Computing article “Using Profile Information for Optimization, Energy Savings and Procurements¹.” The commercial Allinea MAP profiler also provides information programmers need to optimize their HPC workflows.

Regardless, people need the ability to find quantifiable, data-driven answers to their questions about application, workload and data center efficiency. The increasing size and dynamic nature of global HPC operations along with the inclusion of heterogeneous hardware just means people need additional help in monitoring and optimizing work- flows and data center operations.

REFERENCE

1. “Using Profile Information for Optimization, Energy Savings and Procurements, February 2015, Scientific Computing. www.scientificcomputing.com/articles/2015/02/ using-profile-information-optimization-energy-savings-and-procurements

Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com.

Related Articles Read More >

Q-CTRL hires defense industry leader to expand business partnerships in US and UK

Quantinuum provides RIKEN large-scale hybrid quantum–supercomputing platform

Finland’s 20-qubit quantum computer launch continues its supercomputer development

Bigger and better quantum computers possible with new ion trap, dubbed the Enchilada

Search R&D World