Cloud Computing: Pie in the sky?
Infrastructure offers potentially big changes
Cloud computing is an infrastructure paradigm that moves computation to the Web in the form of a Web-based service. The idea is simplicity itself — institutions contract with Internet vendors for computational resources instead of providing those resources themselves through the purchase and maintenance of computational clusters and supercomputer hardware. As expected with any computational platform — especially one utilized for high performance and scientific computing — performance limitations within the platform define which computational problems will run well on that platform. So, the question becomes, “Is cloud computing a useful computational platform for the computational problems you and your institution need to address?”
It is important to note that cloud computing is considered distinct from grid or utility computing. With cloud computing, the user effectively rents time on virtual machines to run their own applications — they do not own any computing resources and may not use any service provider applications. Through the use of pre-configured virtual machines, scientists avoid any of the technical issues associated with getting an application to run — they just use it. This helps to avoid the pointed joke that defines a computer as, “a device that turns excellent scientists into mediocre programmers and systems administrators.
Latency
Since cloud computing is Internet-based, the first and most obvious question to ask is whether the Internet connection utilized by the cloud computing service provides sufficient capability to handle the demands of your applications. Specifically, does it add too much latency or limit bandwidth? (To better understand how to compare your existing HPC platforms with a cloud computing option, I highly recommend reading my February 2007 Scientific Computing column, “HPC Balance and Common Sense,” which you can find in the archive section of the Scientific Computing Web site.)
The impact of communication latency within the cloud is of paramount importance as it affects the scalability and time-to-solution for many scientific applications. There is a reason why people frequently spend large sums of money on high-end communications fabrics for scientific clusters and supercomputers — these networks have low-latency. Unfortunately, cloud computing depends heavily on the Internet communications infrastructure or the network within a virtualized datacenter to link all the computational nodes (along with the user) to the cloud. While the investment in the communications infrastructure for the Internet is clearly substantial, it also was designed as a latency-tolerant network. Similarly, the network investment in a virtualized datacenter may not be designed to meet the needs of scientific and HPC computing.
As users of the Internet, we all know how the vagaries of Internet access can introduce delays and frustration into our lives while surfing, e-mailing or utilizing some other Internet-dependent application. Humans interact via the Internet on time scales measured in seconds. In contrast, many HPC algorithms and applications perform and scale (or bottleneck and die from a performance perspective) based on communications links with delays measured in just a few microseconds (millionths of a second). Such a disparity in time-scales can result in orders of magnitude increases in time-to-solution when running tightly coupled computations in a cloud environment.
HPC communications infrastructures also provide dedicated high-bandwidth links with quality of service (QoS) capabilities in gigabits per second. Evan Felix, David Brown and I have been performing distributed file-systems tests on Pacific Northwest National Laboratory’s Chinook supercomputer, which demonstrates sustained data rates of over 250 GB/s or 1/4 TB/s. (Note: delivering a terabyte per second requires over 8,000 individual gigabit links). Currently, even high-end Internet-connected HPC sites with multiple 10 gigabit (10 Gbs) links are struggling to move a terabyte of data across the Internet overnight with applications such as bbcp, which raises concern for running data-intensive applications in the cloud.
Other performance considerations
In addition, there are a multitude of other issues that can affect performance in a cloud computing environment. Many (but not all) of these issues stem from differences between the uncontrolled nature of the World Wide Web or virtualized datacenter compared to the controlled environment of a traditional computational cluster or supercomputer center.
Virtual machines, for example, can introduce additional latency through the time-sharing nature of the underlying hardware. Since the service provider — and not the scientist — controls the hardware, unanticipated sharing and reallocation of machines can significantly affect runtimes.
Depending on the capability of the underlying hardware, virtual machines can require all data transfers to go through the hardware CPU, which can limit disk and network bandwidth, as well as slow processor-bound applications. (For more information on virtual machines, please see my 2007 column in the Scientific Computing archive, “The Cure for HPC Neurosis: Multiple, Virtual Personalities!”)
Happily, users may never observe a machine failure when running in a cloud environment. Depending on the cloud computing service provider, snapshots can be taken of the virtual machines as they run jobs. If the hardware fails for one virtual machine, then all the virtual machines can be restarted from the last snapshot. I have been long advocating such an approach for HPC supercomputers. My guess is that even the largest supercomputing centers will adopt this model in the next five to 10 years to preserve application uptime.
Finally, while it escapes me why someone might want to hack the results of a scientific computation running across the Internet, experience has shown that, unfortunately, there are people out there who like doing those kinds of things. Thus, security is a concern.
Despite the potential limitations and concerns there is clearly a cloud computing bandwagon. So, why is cloud computing generating such interest?
Appeal of cloud
One strong appeal of cloud computing is the apparent simplicity of contracting with an Internet provider to access the tremendous amount of unused (or underused) computational resources to deliver an astounding “bang for the buck” computational platform that requires very little capital expenditure, aside from providing a high-bandwidth link to the Internet. Since most institutions already provide “big Internet pipes,” cloud computing provides a way to use (and further justify) the investment in that resource.
This can be a boon for scientists and research departments that wish to generate preliminary results before submitting requests in the highly competitive process to get time on large supercomputers or to gain funding for local computational resources. It will be interesting to see how the economics of cloud computing unfold.
Amazon Elastic Compute Cloud
The Amazon Elastic Compute Cloud (EC2) is one example of a cloud computing service that is available to the scientific community. Essentially, Amazon allows users to specify virtual machines that can run on unused hardware resources within their datacenters as those resources become available, so that users only pay for those resources they actually use. Amazon, in turn, is able to generate revenue from unused hardware that must be kept online to address unexpected peak loads.
Researchers also report that the network within the EC2 virtualized datacenter appears to meet their application requirements. For example, J.J. Rehr, J.P. Gardner, M. Prange, L. Svec and F. Vila at the University of Washington compared their department’s computational cluster with the Amazon EC2 running the x-ray spectroscopy and electronic structure code FEFF. They reported in their paper “Scientific Computing in the Cloud”, (arXiv:0901.0029), that EC2 performed and scaled comparably with their AMD Opteron cluster.
From this paper, we can infer that the EC2 low-latency datacenter network and associated virtual machine overhead provides a balanced platform for running the FEFF application. Further, the authors created pre-configured virtual machines so other scientists can benefit from their work and run FEFF, as the authors noted, “without the need to purchase and maintain or even understand HPC hardware.” Please contact the authors for more information if you would like to run FEFF in a cloud environment.
Papers like these represent solid investigations into the potential of cloud computing to address the cross-spectrum needs of the scientific community. Essentially, scientists who wish to just run a simulation and analyze the results can do so without making any significant investment in computer hardware or time to get the code to run.
View to the future
The University of Washington eScience Institute continues to investigate the potential of cloud computing through efforts such as their upcoming workshop “Research Computing in the Commercial Cloud.” Please see the University of Washington eScience Institute Web site for more information (escience.washington.edu).
Thus far, we have only discussed cloud computing based on the commonly used message passing interface (MPI) programming model. Without doubt, newer programming frameworks, such as MapReduce (see hadoop for a downloadable version) and other programming patterns motivated by Google’s “Cloud Computing” will certainly have an effect.
Looking to the future, my bet is that programming models evolving from NVIDIA’s massively threaded CUDA (Compute Unified Device Architecture) and other efforts, such as OpenCL, have the potential to change the face of cloud computing because they exploit massive multithreading to hide latency.
The future looks bright, although there are some big clouds out there! Happy “cloud” computing.
1. The University of California at Berkeley paper “Above the Clouds: A Berkeley View
of Cloud Computing” (2/2009) contains some good performance information:
www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf
Rob Farber is a senior PNNL research scientist working with the William R. Wiley Environmental Molecular Sciences Laboratory, a Department of Energy national scientific user facility located in Richland, Washington. He may be reached at editor@ScientificComputing.com.