Modifying Google’s MapReduce to Increase GPU Cluster Computing

Modifying Google’s MapReduce to Increase GPU Cluster Computing
Adding three new steps allows GPU and MapReduce to work together

This volume-rendered image of a supernova from a volume dataset of size 1024^3 took approximately one second with eight GPUs, achieving approximately 90 percent parallel efficiency (speedup/(ratio of #processes).
Courtesy of Jeff A. Stuart

A modified version of MapReduce — Google’s patented program for distributed and cluster computing — harnesses the power of graphics processing units (GPU) for large-scale, high-performance applications, claim University of California, Davis computer science researchers. In benchmark performance tests, GPMapReduce increased both speed and efficiency on a GPU cluster, explained UC Davis graduate student Jeff Stuart, who with electrical and computer engineering professor John Owens developed the new approach.

Their work continues a CPU-to-GPU data-processing trend that started earlier this decade.

“The performance of single-core CPUs has stagnated, because we haven’t had a significant boost in single-core processor speeds in years,” Stuart explains. “At the same time, graphics processor programmability and performance have increased dramatically, with order-of-magnitude gains across a broad variety of applications.”

Those applications include cloud computing hosted on massive server farms. Amazon’s Elastic Compute Cloud (EC2) service; AMD’s Fusion CPU-GPU hybrids; Intel’s Sandy Bridge and Ivy Bridge projects; and NVIDIA’s Project Denver all utilize GPU power for next-generation cloud applications.

“GPUs particularly excel at the throughput-oriented workloads characteristic of scientific computation and large-scale server applications,” Owens says.

Trend reversal
Despite their high-performance promise and increasing use, GPU clusters have generally been neglected by programmers, Stuart says. As a result, “GPUs have yet to effectively tackle problems of scale.”

Instead, in an ironic reversal of the trend away from single CPU, CPU clusters have remained ahead of their GPU counterparts. “We tend to implement large-scale problems on CPU clusters and develop powerful toolsets for multiple CPUs,” Stuart explains.

Even more ironic: one of the toolsets keeping CPU clusters in the lead is MapReduce itself. Relatively easy-to-use, MapReduce “focuses on the problem, rather than the mundane details typical of distributed programming,” such as memory allocation and communication between software and devices, Stuart explains.

In a two-step process that mirrors MapReduce’s name, a master node “maps” the input problem, breaking it into smaller sub-problems that individual worker nodes solve. MapReduce then combines the worker node sub-solutions, “reducing” them to a single solution.

Highly-scalable, MapReduce can process one petabyte (one quadrillion bytes) of data on a CPU cluster in just hours. Adapting the Google innovation for GPU clusters seems a logical next step.

“Typical MapReduce tasks are easily modified to fit into GPMap-Reduce and leverage a GPU cluster,” the UC Davis team writes in a recent paper on their effort.

FFT on a GPU
When they introduced a method to perform fast fourier transforms (FFT) — image-digitizing algorithms that convert colors into pixels — on graphics cards, Sandia National Laboratories computer scientist Ken Moreland and University of New Mexico computer scientist Edward Angel helped to kick-start a GPU revolution.

MapReduce work flow: GPU executes dashed-outline stages; CPU executes solid-outline stages. Dashed line transition between Bin and Sort stage denotes that GPMR transfers data to another processor, potentially over the network. GPMR individually streams chunks to GPUs and executes a Mapper on each. If supplied, GPMR executes a Partial Reduction and then, if no Combiner is supplied, a partition. GPMR copies key-value sets to CPU and transmits them over the network while scheduler loops over a new chunk. If user supplies a Combiner, GPMR stores all key-value pairs in CPU memory (due to size limitations of GPU memory) until all Maps complete, then executes the Combiner on each unique key before partitioning key-value pairs. Once each process receives all of its pairs, GPMR executes Sort, followed by Reduce.

Aptly entitled “The FFT on a GPU,” their 2003 SIGGRAPH Graphics Hardware Proceedings paper mapped a new route around what they termed “the serious bottleneck of another processing unit, such as the CPU.”

Moving memory-intensive graphics data to and from the graphics processor was holding back high-performance applications. But “amazingly cheap, powerful and flexible” graphics cards could change all that by running the applications directly, Moreland and Angel reasoned.

To demonstrate, they built a GPU application that rendered a colorful, three-dimensional image of a teapot using a fast fourier transform, but without the need to transfer data between CPU and GPU memory. The results convinced them that graphics processors had an as yet unrealized role in high performance computing.

The natural next step, however — GPU clustering — has yet to gain significant traction, eight years later.

Cluster of challenges
Although GPU clustering “is certainly on the radar of forward-thinking companies such as Facebook and Twitter,” the concept presents several challenges, including difficult programming, John Owens explains.

The most popular GPU interfaces should take some of the headaches out of programming. But NVIDIA’s Compute Unified Device Architecture, or CUDA, is proprietary, which “concerns buyers,” Owens says. And its open-source counterpart, OpenCL, has performance problems.

“While none of these obstacles are insurmountable, taken together, they might make a potential buyer look elsewhere,” he explains.

To surmount the obstacle that started it all — performing fast fourier transforms — computer science professor Yifeng Chen, Ph.D. and colleagues at Peking University in Beijing, China, have built a different GPU clustering algorithm called PKUFFT.

Chen says he is less inclined to blame poor programmability for the unusual lag in GPU cluster technology, instead looking to the state of technology itself.

“Hardware and computer architecture of every sort have been changing way too fast in the last five years, but this situation will improve,” he explains. “Programming a GPU cluster will always be more challenging than programming a CPU cluster, but the complexity will eventually become manageable.”

Chunks of inputs
To manage GPU cluster programming, GPMapReduce (GPMR) adds three new steps — Chunking, Partial Reduction and Accumulation — to Map and Reduce.

Graphics processors and MapReduce work at odds with one another. The GPU works on hundreds, thousands, even millions of items concurrently, while each MapReduce “mapper” works on one input item at a time. To get GPU and MapReduce to work together, the UC Davis team introduced chunking — grouping single inputs into data chunks.

Instead of one item at a time, GPMapReduce works on chunks of items that better utilize GPU capacity. “GPMR is the first GPU MapReduce library to give this power to the user,” Owens explains.

Partial Reduction adds an otherwise optional stage to the Map function by grouping similar key-value pairs — standard computer data representations. “This step helps reduce the size of data transfers, lowering the execution time of a MapReduce task,” explains Stuart.

Accumulation is a final icing-on-the-cake step, optimizing GPMR by combining map tasks. “Accumulation can drastically reduce Peripheral Component Interconnect Express (PCI-e) traffic and significantly improve performance of a MapReduce job,” Stuart explains.

Scaling new heights
To measure what Stuart and Owens call “perhaps the most important trait of any MapReduce application: its scalability,” they tested GPMR at the National Center for Supercomputing Applications in Urbana-Champaign, IL. On an “accelerator cluster” with 32 nodes, each equipped with four NVIDIA GT200 GPUs, the team performed tests on up to 64 GPUs, “a small to medium cluster,” Stuart says.

Written in C++ and CUDA, GPMR ran five benchmarks typical for testing new MapReduce libraries:

Matrix Multiplication: multiplies two large square matrices
Sparse Integer Occurrence: counts the number of times each integer appears in a large number set
Word Occurrence: counts the number of times each word occurs in a text
Linear Regression: draws a line graph from a set of data
K-Means Clustering: clusters data points

The results were encouraging, with matrix multiplication scaling best. Though input-output bottlenecks interfered with the other four benchmarks, the team nonetheless observed scalability “previously unattainable with other GPU MapReduce libraries,” Owens says.

Apropos application
One immediate, practical GPMapReduce application is, appropriately, Google or Microsoft Bing Maps, Stuart explains.

GPMR would first generate a set of data tiles defined by points on Earth. Mapper inputs would include features such as a roads, parks, lakes, cities, borders and points of interest defined by vectors and metadata.

Inputs in hand, the mapper would output all the tiles each feature intersects, returning a map of {tile, feature} pairs. The reducer would then generate an image of the tiles that includes all the features.

“GPUs would be great at accelerating the reducer, and might even be suitable for the mapper,” Stuart explains.

Sounds like a technology Microsoft and Google should be clamoring for — but not so fast, says Peking University’s Chen.

“I am not expecting to see any dominant programming standard for GPU clusters anytime soon,” Chen explains. The pace of computer architectural change needs to slow long enough for programmers to adopt various standards — and catch up. “Once the right programming technology is standardized, catching up will take only a couple of years,” Chen says.

Mike Martin is a science and technology writer based in Columbia, MO. He may be reached at[email protected].

Related Articles Read More >

Luma Lab Connect unlocks value of lab data to accelerate scientific R&D decision-making

QDx Pathology Services adopts Proscia’s software to improve speed and precision

Q-CTRL integrates their error suppression technology into IBM Quantum services

New GraphPad Prism offers open file format, plus collaboration, integration capabilities

Search R&D World