Grok-5: AGI or battleship Yamato of AI?

xAI’s Colossus supercluster in Memphis reached 200,000 NVIDIA GPUs in 214 days. It took 122 days to deploy the first 100,000, another 92 to double capacity. (Source: xAI)

In August, Elon Musk touted the potential of Grok 5, the namesake AI from the billionaire’s xAI startup. “I think it has a shot at being true AGI. Haven’t felt that about anything before,” he said. He has also claimed “higher intelligence density per gigabyte” than competitors.

While definitions of AGI — artificial general intelligence — vary, the term implies AI that can perform any intellectual task at least as well as a human.

According to rumors, Grok 5 is clearly ambitious. With a rumored Q1 2026 release, leaked specifications paint a picture of brute-force scaling: a 6-trillion parameter Mixture-of-Experts architecture (roughly double Grok 4’s rumored 3T), trained on the “Colossus 2” supercluster with more than 200,000 NVIDIA GPUs drawing approximately 1 gigawatt of power. That’s enough to run a small city. xAI claims Grok-5 will have a native 1.5-million-token context window, real-time multimodal processing and integration with live X data streams, while training on Tesla.

The training of Grok-5, delayed from a planned end-of-year 2025 launch, could indicate the apex of the “Naïve Scaling” era. The so-called scaling laws, which held that model performance improves predictably and log-linearly with compute, parameters, and training data, appear to be hitting diminishing returns on reasoning benchmarks. Meanwhile, the AI race has grown tighter. The December 2025 LMArena leaderboards show the top models separated by as little as 10 ELO points, statistical noise. If Grok-5 delivers a generational leap, it could break from this pack. If the “Saturating Returns” hypothesis holds, it risks becoming a costly confirmation of diminishing returns.

The parallel to history’s most famous white elephant is hard to ignore: Japan’s battleship Yamato, arguably the largest and most powerful ever built, was obsolete before it fired a shot: aircraft carriers had already rendered the battleship era moot. Grok-5 risks a similar fate: the apex of one paradigm arriving just as the next renders it irrelevant.

1. ‘Thinking’ is not a given

OpenAI’s o1 model introduced “Test-Time Compute” in 2024, spending more inference cycles to reason through problems. That was a breakthrough. Now it’s the norm.

xAI already has a competitive thinking architecture; Grok-4.1-thinking trails Gemini by just 10 points in terms of text processing. The question isn’t whether Grok-5 will have System 2 capabilities; it’s whether 6T parameters amplifies or merely inflates them. Here’s an example of how close the models are based on a December 8 snapshot from LMArena.

Lab	Flagship	Thinking Variant	Arena Rank
Google	Gemini 3.0 Pro	Integrated	#1 (1491)
xAI	Grok-4.1	Grok-4.1-thinking	#2 (1481)
Anthropic	Claude Opus 4.5	Opus 4.5-thinking	#3 (1471)
OpenAI	GPT-5.1-high	Native	#6 (1457)

2. The hardware gamble

xAI bet on Ethernet (Nvidia Spectrum-X with BlueField-3 DPUs) over industry-standard InfiniBand, a contrarian choice at this scale.

MoE architectures require “All-to-All” communication where every GPU exchanges data with every other, creating massive “Incast” congestion when thousands of packets converge on single switches (Xue et al., ACM TACO 2020). Reliable expert routing demands sub-100µs latency. BlueField-3’s ARM cores face inherent limitations: in-order execution restricts instruction-level parallelism, and interrupt handling overhead accumulates at high data rates.

However, recent BlueField-3 benchmarks show 61% latency reduction and 82% bandwidth improvement when hardware offloading is properly configured (Michalowicz et al., IEEE Hot Interconnects 2023). The key enabler: HPCC (High Precision Congestion Control), which achieves near-zero in-network queues (Li et al., ACM SIGCOMM 2019). The StaR architecture demonstrates 4.13× throughput improvement by offloading connection state (Wang et al., ICNP 2021).

Assessment: Likely that Colossus sustains training with aggressive hardware offloading. Success validates Ethernet at hyperscale; failure bottlenecks the entire 6T run.

3. The data paradox: Real-time X vs. ‘brain rot’

xAI’s unique moat is real-time access to the X firehose. Research suggests this cuts both ways.

Models trained heavily on social media data risk what researchers call “Model Autophagy Disorder,” degraded output quality when AI systems train on AI-generated content (Alemohammad et al., ICLR 2024). High-engagement text optimizes for emotional resonance, not logical coherence. xAI claims “curiosity-driven” filtering isolates signal from noise, but it’s unlikely that any automated curation fully neutralizes the distributional shift. Worse: if Grok-5 generates content that feeds back into X’s corpus, this self-consuming feedback loop accelerates.

4. The “world model” wildcard: Tesla video

AI pioneer Yann LeCun argues LLMs fail because they lack a “World Model,” implicit physics and causality. Musk’s counter-wager: Tesla FSD video data.

By Hasuya Hirohata – This photo is part of the records in the Yamato Museum (PG061427), courtesy of Kazutoshi Hando., Public Domain, https://commons.wikimedia.org/w/index.php?curid=382832

By training on video prediction, Grok-5 may learn physical dynamics, object permanence, motion, spatial reasoning. Research on multimodal fusion suggests video-LLM integration can meaningfully improve spatial reasoning capabilities (Han et al., Information Fusion 2025). FSD data likely improves embodied reasoning. But it is less certain whether it transfers to abstract symbolic reasoning (ARC-AGI’s core). Video teaches physics intuition, not logical inference.

5. The economic bubble

The Colossus cluster represents $7B+ in hardware drawing about 300 MW. Running a 6T model costs 3–5× more per token than GPT-4-class models, even with MoE sparsity (Cao et al., MoE-Lightning, ASPLOS 2025; Kong et al., SwapMoE, ACL 2024). Enterprise adoption increasingly favors “distilled” models that are “good enough” at 1–10% the cost.

The broader AI investment climate amplifies these concerns. Goldman Sachs head of global equity research Jim Covello asked in 2024: “What trillion-dollar problem will AI solve?” He noted then that spending patterns represent “basically the polar opposite of prior technology transitions.” That same year, Sequoia Capital’s David Cahn framed it as “AI’s $600 billion question“: whether the technology can ever recoup massive data center investment. MIT economist Daron Acemoglu, the 2024 Nobel laureate, warned more recently that “these models are being hyped up, and we’re investing more than we should.” Hyperscalers, Amazon, Google, Meta, Microsoft, are collectively spending approximately $400 billion on AI infrastructure this year, with some devoting 50% of current cash flow to data center construction. A Barclays research note titled “Cloud AI Capex: FOMO or Field-Of-Dreams?” warned the industry could be headed for an “overbuild” similar to the telecom crash that followed the dot-com bubble.

Likelihood: Grok-5 will face challenging unit economics in the general market. However, xAI’s captive integration with Tesla (for Optimus/FSD) and X (for search) provides a strategic buffer against pure market price sensitivity.

Verdict: The “Yamato” risk

Grok-5 is a high-stakes validation test for the “Densing Laws,” the theory that efficiency gains can prolong the life of pure scaling.

Grok-5 is more likely to confirm the scaling plateau than transcend it. The Tesla video data is the genuine wildcard, if video prediction translates to generalizable reasoning, xAI may have found the “world model” shortcut LeCun insists is missing. But the base case remains a competitive-but-not-dominant model that excels at multimodal tasks while hitting the same reasoning ceiling as everyone else. The bear case, where infrastructure friction and data quality issues degrade the training run, is a real risk, not a tail scenario.

1. ‘Thinking’ is not a given

2. The hardware gamble

3. The data paradox: Real-time X vs. ‘brain rot’

4. The “world model” wildcard: Tesla video

5. The economic bubble

Verdict: The “Yamato” risk

Related Articles Read More >

AI image firm Midjourney spins up health division, unveils ‘Ultrasonic CT’

SpaceX is now worth nearly as much as 41 aerospace peers combined. Its revenue is another story

Q&A: Owkin’s five-year Sanofi deal bets on ‘purpose-built’ AI agents

Is Karpathy’s viral LLM wiki helpful? My opinion after one month of experimenting with one.

Search R&D World