Increasing Cluster Efficiency — Just Push ‘Pause’

Bill Bryce is Vice Presidents of Products at Univa. I spend my days (and sometimes nights) dedicated to workload optimization of data centers. My goal has always been to improve the utilization of clusters delivering real-world outcomes for organizations using shared compute resources.

In a typical customer environment, applications are rarely stopped or preempted in order for other work to run because, historically, this results in losses of time and money. Previously, high-priority workload was often left waiting for lower-priority applications to cede use of the resources.

We hear about this dilemma loud and clear from a multitude of companies in life sciences, electronic design, financial services, manufacturing, scientific research and oil and gas exploration. What they want is a function similar to a home DVR system, where users can be confident that if they pause one program to watch another, they will always be able to come back to finish the first.

For example, we have many customers who run Univa Grid Engine in the electronic design automation market — basically, these are the companies that design computer chips. They have very large clusters running with hundreds and sometimes even thousands of users and run 10-100K jobs per day in their cluster. These customers must purchase software licenses for design tools, and those licenses can be very expensive, sometimes $100K or more for a single software license. In order to run an efficient operation, they use the software licenses continually and keep the cluster as busy as possible.

But what happens when a more important project comes along? Well, in the past, they had to either wait until some machines and licenses became available (which could take up to a day for some of these licenses), or they had to stop another lower-priority project’s work — losing all of the time that was spent on that lower-priority project.

As you can imagine, users have not been too happy about this. You are always upsetting someone to get the high-priority work done, and the team members in the lower-priority project just have to run their work again — messing up their engineering timelines and schedules.

There is a new solution, however, that is becoming a reality. Now, for the first time, many data centers can preempt or “pause” workload, services or containers in their system on a moment’s notice in order to run higher-priority workload. This new capability allows users or administrators to decide which work is highest priority and, with a single command, start the high-priority work and pause lower-priority work. The preempted work automatically resumes once the higher-priority application is completed.

The first such solution is the Preemption feature for Univa Grid Engine users. With the Preemption feature, users can pause the lower-priority design application and preempt the software license for that application, giving it to the higher-priority project. This is really the best of all possible approaches. Once the high-priority work has completed, Univa Grid Engine makes sure that the lower-priority project gets the machines and the software license back — so the application can run to completion.

In today’s big data environments, priorities need to be set to avoid “wasting” software on lower-priority projects. If someone is consuming application resources, and priorities shift, that work can and should be paused so that more important work can be the number-one focus. The result is improved efficiency and productivity.

Bill Bryce is Vice Presidents of Products at Univa.

Related Articles Read More >

Q-CTRL hires defense industry leader to expand business partnerships in US and UK

Quantinuum provides RIKEN large-scale hybrid quantum–supercomputing platform

Finland’s 20-qubit quantum computer launch continues its supercomputer development

Bigger and better quantum computers possible with new ion trap, dubbed the Enchilada

Search R&D World