Why comparing AI clusters to supercomputers is bananas

[image from OpenAI’s image model]

Okay, deep breaths. Maybe you’ve heard the buzz around Google’s Ironwood TPUs, which generated at least one headline claiming its system offered a 24x performance boost over the world’s most advanced supercomputer, El Capitan. Or perhaps the news about Nvidia’s Blackwell line of GPUs, its forthcoming exaflop Vera Rubin platform, or xAI’s Colossus cluster, which is already operating with over 200,000 NVIDIA GPUs in Memphis with plans to scale to more than one million, caught your attention.

Cue the tech world frenzy, visions of Skynet becoming self-aware on a single chip, and maybe a few premature obituaries for traditional High-Performance Computing (HPC).

The common thread? Mind-boggling performance numbers often hitting ExaFLOP scale, touted for both individual chips (at low precision) and massive clusters.

But hold your horses (and your Exaflops). Comparing these current and future AI-focused systems directly to supercomputer giants like the El Capitan supercomputer based solely on that headline FLOPs number? That’s not just apples and oranges. It’s bananas.

The core problem: Not all FLOPs are created equal

The culprit? Floating-Point Precision.

FLOPs = Floating-Point Operations Per Second. It’s a measure of calculation speed. Sounds simple, right?
The catch: How you calculate matters. Computers store numbers using bits. More bits = more precision (accuracy), but usually = slower calculations and more memory/power.

It is a bit like math homework.

Supercomputers (like Frontier and El Capitan): These systems are designed for tasks like simulating nuclear physics, complex climate models, or precise fluid dynamics. Accuracy is key. Such systems primarily work in FP64 (64-bit Double Precision). This is like doing calculus with 10 decimal places. It’s hard, takes time, but you need the precise answer. The TOP500 list ranks these machines based on FP64 performance.

AI accelerators (like TPUs & GPUs): AI Accelerators (TPUs and GPUs like Blackwell, Ironwood, and future Rubin): These are optimized for the mathematics underpinning deep learning: primarily vast amounts of matrix multiplications and related operations used in training and inference. While training models on massive datasets like the entire internet (for LLMs) or complex genomic sequences (for scientific AI like AlphaFold) is hugely computationally intensive, the algorithms themselves are often resilient to slight numerical variations. They thus don’t necessarily need the same hyper-accuracy that some fields of scientific research demand. These chips thus use lower precision formats like FP16 (16-bit), BF16, TF32, or the new darlings FP8 and even FP4 (8-bit or 4-bit!). This is like doing rough estimates or rounding liberally. You get a good enough answer way faster and using less energy.

Comparing 42.5 (FP8) Exaflops to 2 (FP64) Exaflops is fundamentally misleading. Google notes in a blog post that Ironwood achieves its 4,614 Teraflops (4.6 Petaflops) per chip peak performance using FP8 (8-bit Floating Point), a format it now natively supports.

Comparing FLOPs can be complicated

While fewer bits tends to translate to less work and thus more speed, that isn’t always the case. The reality depends on whether the underlying hardware and software are optimized for that specific low-precision format.

Let’s illustrate with a quick Python example using NumPy, a standard library workhorse for numerical operations. We’ll compare matrix multiplication using standard float64 (our FP64 stand-in) versus float16 — a lower-precision format that uses twice the bit depth per number compared to FP8.

Decoding the ExaFLOP hype

Recent headlines, especially around Google’s Ironwood, highlight significant performance gains, which occasionally invite comparisons to supercomputers.

Here’s the catch (It’s All About Precision):

AI Accelerators (Ironwood, GPUs): Optimized for AI tasks (training/inference). They achieve big FLOPs numbers (like Ironwood’s claimed 42.5 ExaFLOPS/pod) by using low-precision formats (FP8, FP16). This is faster and power-efficient for AI algorithms tolerant of less numerical accuracy.
Supercomputers (El Capitan): Built for high-fidelity scientific simulations (physics, climate). They require high-precision (FP64) calculations for accuracy. El Capitan’s roughly 1.7 ExaFLOPS measured performance is based on this demanding FP64 standard (HPL benchmark).

The faulty comparison: Directly comparing Ironwood’s peak low-precision (FP8) number to El Capitan’s benchmarked high-precision (FP64) number is trying to compare different kinds of computational work optimized for entirely different problems.

It’s not apples-to-apples; it’s comparing a drag racer’s top speed (FP8) to a Mars rover’s scientific mission capability (FP64).

To explore this further, I conducted a deeper dive on a NVIDIA A100 GPU, increasing the matrix size to 4096×4096 and testing different types of operations beyond just matrix multiplication: complex element-wise calculations and large-scale summation (reduction).

With the larger 4096×4096 matrices, FP16 absolutely flew, running about 15 times faster than FP64 in a Google Colab test environment. Yet, standard single-precision float32 (FP32) was still slightly slower than FP64 for this matrix multiplication task on the A100. For element-wise calculations (applying a function to each number individually), FP64 remained the speed king on the A100. Yet for summing millions of numbers (a reduction operation), both FP32 and FP16 were significantly faster than FP64.

But speed came at a cost. While the precision loss for FP16 was noticeable but perhaps acceptable in the matrix and element-wise tests, it led to catastrophic failure in the summation test, yielding meaningless results (NaN error). This starkly illustrates the danger: the limited range and precision of FP16 can break numerical stability in algorithms sensitive to accumulating errors. The consistent advantage, however, was memory: FP16 used only a quarter of the memory of FP64, a crucial factor for fitting today’s enormous AI models onto hardware.

Perhaps a better premise than directly comparing AI clusters to supercomputers based on peak FLOPs is to ask a more fundamental question: What level of numerical accuracy do you actually need for the problem you are trying to solve?

AI’s Tolerance: Many deep learning algorithms have proven remarkably resilient to the “noise” introduced by lower precision. Approximate results are often sufficient for tasks like image recognition or language generation, allowing systems to leverage the massive speed and efficiency gains of FP16, FP8, and FP4 on specialized hardware (like Tensor Cores).
HPC’s Demand: Conversely, simulating the precise interactions of molecules, the evolution of galaxies, or the airflow over a wing often requires the high fidelity of FP64. Small numerical errors can accumulate and lead to drastically incorrect results in these sensitive calculations.

Therefore, the massive FLOP figures from AI hardware aren’t replacing traditional HPC FLOPs; they’re supplementing them by excelling at a different class of problems where lower precision is not only acceptable but advantageous. It’s about using the right tool – and the right numerical precision – for the job.

Comparing an AI cluster’s FP8 peak performance to a supercomputer’s FP64 benchmark isn’t just apples and oranges; it’s comparing the speed of a specialized drag racer on a prepared strip to the endurance and all-terrain capability of a Mars rover. Both are impressive feats of engineering, but they are optimized for fundamentally different tasks, driven by different requirements for numerical precision.

Next time you see a PetaFLOP or ExaFLOP number, a better question to ask than pondering how it compares to a supercomputer might be: “What kind of accuracy and what kind of compute operations are we talking about here?”

The core problem: Not all FLOPs are created equal

Comparing FLOPs can be complicated

Decoding the ExaFLOP hype

Related Articles Read More >

The gunslinger’s dilemma: A trillion-dollar R&D arms race where collateral damage risk is unpriced

Sapio survey finds 45% of scientists using unauthorized AI tools, view ELNs as ‘glorified filing cabinets’

Biosero launches GoSimple pre-validated workcells, adds assistive AI to Green Button Go

ABB Brings GoFa Cobots to the Lab Bench, Demos Multi-Vendor Workflows at SLAS

Search R&D World