Musk teases Grok 3.5, new model that reasons from ‘first principles’

[Adobe Stock]

Elon Musk says xAI will ship Grok 3.5 to its top-tier SuperGrok subscribers “next week,” promising an AI that can tackle questions about rocket-engine cycles or electrochemistry by deriving answers instead of recycling web text. Musk claimed in an X post the upgrade will “come up with answers that simply don’t exist on the Internet”

As of about May 11, X users note that the launch of 3.5 was delayed as the model was “still too rough around the edges,” adding that it could launch in “another week or so.”

First-principles reasoning targets one of the biggest holes in large language models: ground-truth accuracy on novel, highly technical questions. Most LLMs lean on pattern matching; ask something outside their training diet and hallucinations creep in. Grok 3 chipped away at that flaw: scoring 52.2% on the AIME’24 math contest and 75.4% on the graduate-level GPQA science set, versus GPT-4o’s 9.3% and 53.6% on the same exams, according to an xAI blog post.

The news also lands just over two months after Grok 3’s Feb. 19, 2025 debut.

Grok 3 trained on Colossus, xAI’s Memphis super-cluster that has already grown to 200,000 GPU, twice its initial size, and is slated to scale to over one million GPUs in coming years. Powered by that silicon, Grok 3 vaulted to a 1,402 Elo rating on Chatbot Arena, topping GPT-4 and Claude 3.5 Sonnet in blind user polls. Whether Grok 3.5’s beta soars or face-plants, its arrival keeps pressure on OpenAI, Anthropic and Google to prove their own models can think like engineers, rather than parrots.

A rumor surfaced, reported by KrebsOnSecurity, suggesting Grok 3.5 may have been trained in part on proprietary materials from SpaceX and Tesla, following an alleged xAI employee’s accidental exposure of a private API key on GitHub. The key reportedly accessed unreleased Grok models, including some named to imply fine-tuning on SpaceX, Tesla, and X data. xAI has not publicly confirmed these reports. If true, however, it would help explain how Grok 3.5 may be the first AI model to answer technical questions about, say, rocket engines.

Road to 3.5

Grok version	Key jump	Notable result
Grok 1 (Nov 2023)	314 B-param MoE; weights later open-sourced	Established uncensored, X-fed model
1.5 / 1.5V (2024)	Reliability tweaks; adds vision input	Matches early GPT-4V demos
Grok 2 (Aug 2024)	128 k-token context	87% on MMLU, parity with GPT-4
Grok 3 (Feb 2025)	1 M-token context; RL “Think” mode	93.3% on AIME; 1 402 Elo on Chatbot Arena
Grok 3.5 (May 2025)	Refinement of 3; emphasis on first-principles reasoning	Beta to SuperGrok; benchmarks pending

Grok 3.5 would enter a crowded LLM landscape where benchmark supremacy shifts monthly. According to LiveBench, a leaderboard focused on contamination-free LLM evaluation, the Grok 3 Mini Beta (High) variant ranks 9th with a global average score of 70.25%, showing especially strong performance in reasoning (87.61%) but weaker results in coding (54.52%), placing it behind models from OpenAI, Google, Anthropic, DeepSeek and Alibaba. Historically, Grok models tend to fare well in mathematical and logical reasoning challenges. For Grok 3.5 to climb higher, it would need to maintain its reasoning edge while addressing gaps in coding and language performance, all while competing against OpenAI’s greater market saturation, Claude’s hybrid reasoning approach, and Gemini’s mathematics dominance (89.16%) and recent coding gains with Gemini 2.5 Pro.

Bottom line: Grok 3.5 could push large language models closer to genuine problem-solving. But roof will come from neutral tests, not tweets.

Road to 3.5

Related Articles Read More >

Why Twist Bioscience’s complex genes offering is a bet on AI-driven protein design

Sandia turns to lightweight AI to speed up ceramic inspections for nuclear weapons components

Analyses find thousands of scientific papers with AI-generated errors

This week in AI research: Fields medalist says GPT-5.5 Pro did PhD-level math in an hour, Anthropic teaches Claude to ‘dream’

Search R&D World