Grid computing unites scientists around the world and uses their collective computing power to investigate science’s unanswered questions.
The Large Hadron Collider (LHC), being built by CERN, the European Organization for Nuclear Research, near Geneva, Switzerland, is the largest scientific instrument on the planet. It is designed to accelerate and collide protons moving at nearly the speed of light into each other in the search for evidence to some of science’s unanswered questions, such as the origin of mass. When it comes online in 2007, the LHC will be able to “see” up to 40 million collision events per second, enabling the detectors of its main experiments, ALICE (A Large Ion Collider Experiment), ATLAS (A large Toroidal LHC ApparatuS), CMS (Compact Muon Solenoid), and LHCb (Large Hadron Collider beauty experiment) to watch as the energy of these collisions mimic the conditions as they were a fraction of a second after the Big Bang.
All this capability will spell a new era for particle physicists and, for that matter, science around the world. But there are real challenges. In operation, the LHC will produce roughly 15 petabytes of data annually, the equivalent of about 3 million DVDs or 100,000 times the storage capacity of the average desktop computer. Thousands of scientists around the world will need to access and analyze this data to find elusive evidence of new particles and forces.
Any single institution would not be able to easily store all of the data produced by the LHC in one place and provide enough computing power to support the scientists who will need daily access to the data. To deal with the vast amounts of data and the accessibility issues, the CERN scientists have turned to grid computing. Grids are the most recent step taken in tapping into the power of distributed computing and storage resources across the world.
Life before the grid
Initially, scientists used clusters of computers to overcome the lack of computational power. First explored in the early 1980s, groups of coupled computers worked together to solve complex problems that could not be solved by one machine alone. Computer clusters, for their part, are still used in supercomputer centers, research labs, and industry to provide significant improvements in total computing power.
The next step in increasing computational power was distributed computing, parallel computing in which the computers used to complete the tasks are in multiple geographic locations. Applications are distributed between two or more computers over a network to accomplish tasks that are too complex for one computer alone. One example of distributed computing is [email protected] This program uses idle CPU time on Internet-connected computers to crunch data from radio telescopes in the Search for Extraterrestrial Intelligence (SETI). By using distributed computers, the program creates a powerful computing system with global reach and supercomputer capabilities. Indeed, distributed computing was the first real step toward today’s computing grids.
Enter the grid
click the image to enlarge
In March 2005, the LCG project surpassed 100 sites in 31 countries which made it the world’s largest scientific grid. Photo: CERN
The term “grid” arose in the late 1990s to describe a computing infrastructure that works like a power grid. Users would be able to access computing resources as needed without worrying about where they came from, much as a person accesses the electric grid. The “power stations” on the computing grid are clusters of computers, and the “power lines” are the fiber optics of the Internet.
In 1995, Ian Foster at Argonne National Laboratory and the Univ. of Chicago, Ill., and Carl Kesselman in the Information Sciences Institute at the Univ. of Southern California, Los Angeles, known as the fathers of grid computing, developed I-WAY, the first true grid computer. Foster, R&D Magazine’s 2003 Innovator of the Year, and Kesselman looked at ways of using network technology to build very large, powerful systems, getting machines in different locations to work on parts of a problem and then combine for the result, rather than writing software to run on multiple processors in parallel. Ultimately, these ideas together formed I-WAY, which enlisted high-speed networks to connect end resources at 17 sites across North America, marking the start of grid computing.
In the summer of 2000, Kesselman went to Geneva to give a seminar on grid computing, and the LHC Computing Grid (LCG) was born. A grid was chosen for the LHC because the significant costs of maintaining and upgrading the necessary resources are more easily handled in a distributed environment. In this way, individual institutes and national organizations could fund local computing resources and retain responsibility for them, while at the same time still contributing to the global goal.
World’s largest international scientific grid
When the LHC is running optimally, access to experimental data will need to be provided for more than 5,000 scientists in 500 research institutes and universities worldwide that are participating in LHC experiments. In addition, this data needs to be available over the 15-year estimated lifetime of the LHC.
Grids, grids everywhere
The Large Hadron Collider Computing Grid collaborates with many other major grid development projects and production environments around the world.
EGEE: The Enabling Grids for E-sciencE (EGEE) project brings together scientists and engineers from more than 90 institutions in 32 countries. Conceived from the start as a four-year project, the second two-year phase of this project began on April 1, 2006. The EGEE is funded by the European Commission. EGEE is a major contributor to the operations of the LCG project.
GridPP: GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. Currently, this grid has 17 UK institutions. When the LHC opens in 2007, GridPP will be used to process the accompanying data deluge by contributing the equivalent of 10,000 PCs to this worldwide effort.
INFN Grid: The INFN Grid project is the used by INFN—Italy’s National Institute for Nuclear Physics—to develop and deploy grid middleware services. The INFN Grid provides, deploys, and operates an open source release, essentially based on EGEE gLite middleware, tailored for the need of the Italian grid infrastructure and user communities.
NorduGrid: NorduGrid develops and deploys a set of tools and services, the Advanced Resource Connector (ARC) middleware, which is a free software. The core of the collaboration historically consists of several Nordic academic and research institutes. NorduGrid interoperates with the LCG.
Grid3: The Grid3 is operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments, ATLAS and CMS. Project highlights include participation by more than 25 sites across the U.S. and Korea which collectively provide more than 2,000 CPUs.
OSG: The Open Science Grid (OSG) is a distributed computing infrastructure for large-scale scientific research, built and operated by a consortium of U.S. universities, national laboratories, scientific collaborations, and software developers. The OSG integrates computing and storage resources from more than 50 sites in the U.S., Asia, and South America.
All of these requirements led to the creation of the LCG, the mission of which is to build and maintain a data storage and analysis infrastructure for the entire high-energy physics community that will use the LHC. The LCG is a worldwide network of thousands of PCs, organized into large clusters and linked by ultra-high speed connections to create the world’s largest international scientific computing grid. Among the LCG’s goals are:
• Developing different software components to support the physics application software in a Grid environment.
• Developing and deploying computing services based on a distributed Grid model.
• Managing users and their rights in an international, heterogeneous, and non-centralized Grid environment.
• Managing acquisition, installation, and capacity planning for the large number of commodity hardware components that form the physical platform for the LCG.
In addition to linking with individual PCs worldwide, the LCG collaborates with many existing science grid infrastructures, among them the E.U.-funded Enabling Grids for E-sciencE (EGEE) project and the U.S. Open Science Grid (OSG) project (see sidebar). At the EGEE’06 conference in Geneva in September, CERN Director General Robert Aymar emphasized the importance of such grids to the LCG. “We are just over one year away from the anticipated launch of the Large Hadron Collider. We expect this device will open up new horizons in particle physics,” says Aymar. “The EGEE infrastructure is a key element in making the LHC Computing Grid possible, and thus the success of the LHC is linked to the success of the EGEE project.”
In terms of deliverables, the LCG is already being tested by ALICE, ATLAS, CMS, and LHCb to simulate the computing conditions expected once the LHC goes online. As a result, LCG partners are achieving record-breaking results for high-speed data transfers, distributed processing, and storage. For example, in 2005, eight major computing centers completed a challenge to sustain a continuous flow of 600 MB/sec on average for 10 days from CERN to seven sites in Europe and the U.S. This exercise was part of a service challenge designed to test the infrastructure of the LCG. The total amount of data transmitted in the challenge—500 TB—would take about 250 years to download using a typical 512 kb/sec household broadband connection.
Vicky White, head of the Fermilab Computing Division, Batavia, Ill., one of the challenge participants, commented, “High-energy physicists have been transmitting large amounts of data around the world for years, but this has usually been in relatively brief bursts and between two sites. Sustaining such high rates of data for days on-end to multiple sites is a breakthrough, and augurs well for achieving the ultimate goals for grid computing.”
However, even with all of these successes, the developers of the LCG are still dealing with some challenges. Among the challenges are ensuring adequate levels of network bandwidth between the contributing resources, maintaining coherence of software versions installed in various locations, coping with heterogeneous hardware, managing and protecting the data, and providing accounting mechanisms so that different groups have fair access, based on their needs and contributions to the infrastructure. Other challenges include how to balance local ownership of resources while making them available to the larger community and how to overcome local security worries about giving access to “anonymous” non-local users.
The brain of LCG
Linking thousands of computers together into one grid requires the use of standard protocols and services. As such, the brain of any computer grid is its middleware which enables the many different networks and resources of a computer grid to look seamless to the user, and allows the user to submit a job to the entire grid. The middleware draws from resource brokers, replica managers, and information services to determine where to best run each job. It then copies or moves the files as necessary, then returns the results to the user, without the user knowing where the results came from. Security is paramount in such a system. Without authorization, authentication, and accounting, there is no grid.
The middleware chosen for the LCG is the Globus Toolkit, which won a Special R&D 100 Award in 2002 for the Most Promising New Technology. Led by Kesselman and Foster, Globus is an open source project that grew out of the grid community’s attempts to solve real problems that are encountered by real application projects. It provides many of the basic services needed to construct grid applications such as security, resource discovery, resource management, and data access. Globus enables the LCG to interpret a user’s request and then autonomously find the appropriate computing resources. It then breaks the job into smaller tasks, allocates the computing power, and starts solving the problem.
Rivers of data
To process the massive amounts of data, the data from the LHC will need to be distributed worldwide. A four-tiered model was chosen for the data distribution. The Tier-0 center of the LCG at CERN will encompass data acquisition and initial processing of the data. In addition, all data will be recorded on a primary backup tape kept at CERN. After initial processing, the Tier-0 center will distribute the data to a series of Tier-1 centers, large computer centers with sufficient storage capacity for the data and with around-the-clock support for the grid. The 11 large Tier-1 centers will carry out the data-heavy analysis and will then make the data available to the 100 Tier-2 centers in 40 countries.
These centers each consist of one or several collaborating computing facilities which can store sufficient data and provide adequate computing power for specific analysis tasks. The Tier-2 centers will simulate the details of the experiments and support the various analysis efforts of groups or individuals. Individual scientists from around the world will access the Tier-2 facilities through Tier-3 computing resources, which can consist of local clusters in a university department or even individual PCs, and which may be allocated to LCG on a regular basis.
Looking to the future
As the date for the LHC’s start approaches, the scale of the LCG in terms of the number of sites is already close to its target of 50,000 PCs. Computational and storage capacity is also beginning to ramp up, but work remains to be done on improving the overall reliability of the LCG. The LCG will continue to grow over the next year by adding sites and increasing resources available at existing sites. In addition, the exponential increase in processor speed and disk storage capacity inherent to the IT industry will help achieve the LCG’s ambitious computing goals. Once the LHC goes online, the LCG will unite scientists around the world in searching for the answers to some of science’s most intriguing questions.