Dinner plate-sized chips could give traditional GPUs a run for their money

Modern graphics processing units (GPUs) pack billions of transistors into a single piece of silicon. This chip density drives advances in protein structure prediction, weather forecasting, autonomous vehicles and the generative AI wave. Yet such extreme density creates cascading challenges, contributing to the power consumption and heat generation problems affecting advanced AI data centers.

A new review paper published by UC Riverside engineers in the journal Device explores an alternative architectural paradigm, known as wafer-scale accelerators. The Cerebras-developed chip design is based on a dinner-plate-sized silicon wafer and is an order of magnitude larger than traditional GPUs. While a typical GPU die size measures around 800 square millimeters (roughly the size of a postage stamp), the Cerebras Wafer-Scale Engine spans more than 46,000 square millimeters. In essence, it uses an entire silicon wafer as a single chip. Cerebras calls the WSE-3 “the largest AI chip ever built.”

The design translates into significantly more computing power and greater energy efficiency than traditional designs, the UC Riverside research team concluded. When comparing standard GPU chips against wafer-scale systems like the Cerebras Wafer-Scale Engine 3 (WSE-3), the researchers found substantially greater computing power and improved energy efficiency per task. For context, the WSE-3 boasts 4 trillion transistors and 900,000 AI-specific cores on a single wafer.

In addition to the WSE-3, the research explored other wafer-scale systems like the Cerebras WSE-2 and the Tesla Dojo Training Tile as well as GPU systems like the NVIDIA H100, NVIDIA H800, previous-generation NVIDIA A100, and Google Trillium v6 (TPU v6e), Google’s tensor processing unit. The paper also references the AMD MI300X, which has 153 billion transistors and combines CPU and GPU architectures.

While wafer-scale systems like the WSE-3 have clear advantages, their monolithic design limit its scalability beyond a single wafer, and their specialized architecture requires customized software ecosystems. In addition, GPU racks remain dominant in AI model training owing to their flexibility and optimization for existing AI workloads.

To learn more about the research, we reached out to Ozkan, who wrote the paper along with an interdisciplinary team of UCR graduate students, including Lily Pompa, Md Shaihan Bin Iqbal, Yiu Chan, Daniel Morales, Zixun Chen, Handing Wang, Lusha Gao and Sandra Hernandez Gonzalez.

In the interview below, Ozkan explains why traditional metrics like Floating Point Operations Per Second (FLOPS) fall short of capturing real-world performance trade-offs when comparing distributed systems like NVIDIA’s H100 Tensor Core GPU, which relies on CUDA (NVIDIA’s proprietary parallel computing platform that was first launched in 2006), versus Cerebras’ monolithic wafer-scale approach. She also discusses emerging frameworks for matching workloads to hardware based on real-time performance, energy and reliability considerations.

Beyond FLOPS and bandwidth, what is your opinion on the single most critical workload characteristic an R&D leader should look for that makes a monolithic architecture like the WSE-3 the unequivocally right choice, versus a workload where a distributed system like an H100 rack will have the advantage?

Portrait of Mihri Ozkan, Professor, Frontier Fellow of National Academy of Engineering.Portrait taken on April 4, 2019. (UCR/Stan Lim)

Mihri Ozkan, Professor, Frontier Fellow of National Academy of Engineering.
(UCR/Stan Lim)

Ozkan: In our recently published article, “Performance, Efficiency, and Cost Analysis of Wafer-Scale AI Accelerators vs. Single-Chip GPUs” (Device, 2025), we show that while raw compute specifications like FLOPS and bandwidth are often emphasized, they do not fully capture real-world performance trade-offs across architectures. One of the most decisive workload characteristics in choosing between wafer-scale monolithic systems (like Cerebras WSE-3) and distributed GPU racks (like H100 clusters) is intra-model communication dependency—specifically, the degree of inter-layer and inter-token data exchange required within short time windows.

In short, the monolithic wafer-scale approach excels when model architecture imposes frequent fine-grained communication, whereas distributed GPU clusters thrive with coarse-grained, easily partitionable workloads. As our upcoming submitted article explores in greater detail, the optimal match also depends on dynamic workload factors such as temporal variability, thermal balance, and hardware aging, which are not captured by static specs alone. While I cannot yet share the full results, this follow-up paper introduces a new framework that dynamically allocates inference workloads across heterogeneous hardware—including WSE-2/WSE-3 and GPUs—based on real-time sustainability and reliability metrics.

Your review details the solutions for thermal management and defect tolerance, the integrated cooling loops and redundant core routing. In the real world, engineering is about managing unstated constraints. Between those two, which poses the greater practical barrier to the future of wafer-scale systems today?

Ozkan: This is a critical question—and one that we explored in our recent article published in Device (Cell Press), where we examine both the hardware-level innovations and the system-level trade-offs involved in wafer-scale architectures like the Cerebras WSE-2 and WSE-3.

From an engineering standpoint, both the liquid cooling infrastructure and redundant core routing are elegant solutions to otherwise prohibitive scaling challenges. However, looking ahead, I would argue that defect tolerance and routing-induced performance variability is likely the more formidable long-term barrier, especially as wafer-scale systems evolve beyond inference acceleration into broader, mission-critical deployment scenarios.

Thermal management via integrated liquid cooling, while mechanically complex, is a mature field with well-established reliability engineering practices. Issues such as Mean-Time-Between-Failure (MTBF) can often be mitigated through modular system design, predictive maintenance and redundancy in the cooling path.

In our upcoming manuscript currently under review, we expand on this by modeling the degradation of wafer-scale systems over time. While I can’t disclose the full framework yet, it underscores that managing partial-good operation is not a one-time calibration problem, but rather an evolving systems challenge that compounds with usage, aging and temperature variance.

So, while cooling systems will always require maintenance and robustness, the dynamic complexity introduced by routing around increasing and uneven defects—especially under real-world AI workloads—may be the deeper long-term headache for wafer-scale computing.

For a potential adopter, like a mid-sized pharma or industrial R&D lab, does cost or software optimization represent the more significant long-term barrier to adoption?

Ozkan: This is a key question for potential adopters evaluating whether to invest in wafer-scale infrastructure—and it goes to the heart of what we emphasized in our recent Device paper: that total cost of ownership (TCO) in AI hardware extends well beyond the sticker price.

For mid-sized R&D labs—such as pharmaceutical, biotech or industrial AI teams without the scale or in-house systems teams of a hyperscaler—the greater long-term barrier is often the software ecosystem transition, rather than just the upfront capital investment. While the hardware cost of a wafer-scale system (estimated at $2–3 million) is undeniably high, that figure is at least quantifiable. What is harder to assess—but often more strategic—is the opportunity cost and operational friction of migrating away from the mature, talent-rich and extensively supported NVIDIA CUDA ecosystem.

Adopting a wafer-scale system means embracing a specialized compiler stack and custom runtime environment. This typically requires dedicated staff retraining, code refactoring and dependency mapping—all of which carry time-to-deployment delays and strategic risk, especially in fast-paced domains like drug discovery or industrial AI.

In our submitted follow-up article, we address this issue by proposing a new framework. In short, for mid-sized adopters, the larger long-term barrier is the hidden cost of ecosystem disruption, rather than capital expenditure alone. As wafer-scale systems evolve and their software environments become more interoperable, we expect that barrier to decrease. But for now, it remains a major strategic consideration.

Related Articles Read More >

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide

Six months in, Lilly says its supercomputer is starting to change the work with ‘near-infinite’ AI tokens

Boltz built its drug-discovery API ‘for agents as much as for people’

Search R&D World