Claude Mythos leads 17 of 18 benchmarks Anthropic measured. Muse Spark put Meta back in the frontier club, and OpenAI's 'Spud' model is reportedly near launch

Anthropic is not planning on publicly releasing it, but its Mythos model leads in 17 of 18 benchmarks, according to data in Anthropic’s model’s system card. The lone outlier is Measuring Massive Multitask Language Understanding (MMMLU), where Gemini 3.1 Pro’s 92.6–93.6 overlaps with Mythos’ score of 92.7.

One day later, on April 8, Meta Superintelligence Labs introduced Muse Spark, its first frontier model under chief AI officer Alexandr Wang. Where Anthropic published a capability report for a model it is withholding, Meta shipped a model that Artificial Analysis ranks fourth on its composite Intelligence Index at 52, behind a Gemini 3.1 Pro and GPT-5.4 (xhigh) tie at 57 and Claude Opus 4.6 at 53.

Anthropic claims its unreleased Claude Mythos Preview will ‘reshape cybersecurity’

Anthropic says Mythos is its “most capable frontier model to date, and shows a striking leap in scores on many evaluation benchmarks compared to our previous frontier model, Claude Opus 4.6.” The company goes onto say that Mythos offers “a step-change in vulnerability discovery and exploitation” that, operating “with minimal human steering,” autonomously finds zero-day vulnerabilities in open-source and closed-source software and develops them into working proof-of-concept exploits.

What the Mythos system card claims

Benchmark	Mythos Preview	Opus 4.6	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	93.9	80.8	not released (n/r)	80.6
SWE-bench Pro	77.8	53.4	57.7	54.2
SWE-bench Multilingual	87.3	77.8	n/r	n/r
SWE-bench Multimodal	59.0	27.1	n/r	n/r
Terminal-Bench 2.0	82.0	65.4	75.1	68.5
Terminal-Bench 2.0 (extended timeout)	92.1	n/r	75.3	n/r
USAMO 2026	97.6	42.3	95.2	74.4
GPQA Diamond	94.5	91.3	92.8	94.3
Humanity’s Last Exam (with tools)	64.7	53.1	52.1	51.4
Humanity’s Last Exam (no tools)	56.8	40.0	39.8	44.4
OSWorld (computer use)	79.6	72.7	75.0	n/r
GraphWalks BFS 256K–1M	80.0	38.7	21.4	n/r
CharXiv Reasoning (with tools)	93.2	78.9	n/r	n/r
CharXiv Reasoning (no tools)	86.1	61.5	n/r	n/r
LAB-Bench FigQA (with tools)	89.0	75.1	n/r	n/r
BrowseComp	86.9	83.7	n/r	n/r
MMMLU	92.7	91.1	n/r	92.6–93.6
CyberGym	0.83	0.67	n/r	n/r
Cybench	100 (saturated)	n/r	n/r	n/r

Benchmark source: Anthropic, “Claude Mythos Preview” system card, red.anthropic.com, April 7, 2026.

Availability and pricing source: Anthropic, “Project Glasswing” announcement page, anthropic.com, April 7, 2026.

Anthropic did not compare Mythos Preview against traditional static analysis tools, as Heidy Khlaaf, Ph.D., chief AI scientist at the AI Now Institute noted on X. While Anthropic benchmarked Mythos against Claude Opus 4.6 and Claude Sonnet 4.6 on Cybench, CyberGym and a new Firefox 147 exploitation evaluation, it did not announce head-to-head data from CodeSonar, Coverity, Semgrep and the other similar tools. Khlaaf also noted on X that Anthropic did not report a false-positive rate for any cyber benchmark.

A Tweet from Ramez Naam, American technologist and science fiction writer, citing Epoch AI Research’s Epoch Capabilities Index (ECI) frames Mythos as a more incremental step-up from earlier model generartions.

While the cybersecurity ramifications of Mythos are clear, compute scarcity likely also shaped the decision to gate it. Frontier labs are triaging GPUs. On March 24, OpenAI killed Sora after the Wall Street Journal reported it was burning roughly $1 million per day against $2.1 million in lifetime revenue. OpenAI said it needed the GPUs for coding and enterprise work and for its unreleased ‘Spud’ model. On April 4, Anthropic cut Claude subscriptions off from third-party agentic harnesses such as OpenClaw. Head of Claude Code Boris Cherny said “capacity is a resource we manage thoughtfully” and that subscriptions were never built for autonomous-agent usage. Read together, Sora’s death, the OpenClaw cutoff and Mythos shipping only to Glasswing partners with $100 million in credits describe an industry routing scarce inference capacity toward its highest-value enterprise customers. Reliability data supports the capacity-strain read. Anthropic’s status page shows claude.ai uptime at 98.73% over the past 90 days, with five Opus 4.6 and Sonnet 4.6 error incidents in the first eight days of April alone. OpenAI logged 75 tracked incidents across its services in the same 90-day window. xAI’s Grok went fully unavailable for more than seven hours on January 27 and again for over two hours on March 10. Google’s Gemini, running on Google Cloud infrastructure, posted only two incidents in the same period. The labs without hyperscaler-grade infrastructure are the ones visibly rationing.

Anthropic claims its unreleased Claude Mythos Preview will ‘reshape cybersecurity’

What the Mythos system card claims

Related Articles Read More >

How Cypris evolved from selling patent reports to agentic R&D intelligence

Medable’s Digital Data Flow Agent focuses on protocol translation as the agentic race accelerates

AI image firm Midjourney spins up health division, unveils ‘Ultrasonic CT’

SpaceX is now worth nearly as much as 41 aerospace peers combined. Its revenue is another story

Search R&D World