How OpenAI's recently released GPT-5.5 stacks up with Anthropic's gated Claude Mythos

[Adobe Stock]

TL;DR: Claude Mythos Preview appears to lead cleanly on six of nine overlapping rows, especially SWE-bench Pro and Humanity’s Last Exam (HLE), but benchmark comparisons between Mythos and GPT-5.5 are imprecise.

The released Opus 4.7 and GPT-5.5 are neck and neck. Opus 4.7 leads on SWE-bench Pro (64.3 vs 58.6), HLE no tools (46.9 vs 41.4) and HLE with tools (54.7 vs 52.2).

Mythos leads: SWE-bench Pro (77.8 vs 58.6 / 64.3), HLE no tools (56.8 vs 41.4 / 46.9), HLE with tools (64.7 vs 52.2 / 54.7), CyberGym (83 vs 81.8 / 73.1), OSWorld-Verified (79.6 vs 78.7 / 78.0), GraphWalks long context (soft comparison but 80 vs 45.4).
Effectively tied: GPQA Diamond (94.5 vs 93.6 / 94.2) and Terminal-Bench 2.0 on the headline numbers (82 vs 82.7). Anthropic’s 92.1% Terminal-Bench 2.1 result shows how sensitive the benchmark is to harness and timeout choices, but it is not a matched GPT-5.5 comparison.
GPT-5.5 Pro leads: BrowseComp (90.1 vs 86.9). That’s the only benchmark in the overlap where any GPT-5.5 variant clearly beats Mythos by a meaningful margin.
Opus 4.7 leads GPT-5.5 base on: SWE-bench Pro, HLE with tools. Ties on GPQA and OSWorld. Loses on Terminal-Bench, BrowseComp, CyberGym.

OpenAI released GPT-5.5 to ChatGPT and Codex on Wednesday, calling it “a new class of intelligence for real work.” Two weeks earlier, Anthropic published a 245-page system card for Claude Mythos Preview, a model it said was sufficiently capable at vulnerability discovery and exploitation that the company decided against general release. (OpenAI first withheld the full release of its GPT-2 language model in February 2019.)

Meanwhile, the New York Times recently reported that Anthropic’s decision has set off a global scramble, with the Bank of England governor warning that Anthropic may have found a way to “crack the whole cyber-risk world open.” The European Commission met with Anthropic at least three times without securing access, according to the Times.

Neither system card places the two models head to head. Anthropic compares Mythos primarily against Claude Opus 4.6, GPT-5.4 and Gemini 3.1 Pro. OpenAI compares GPT-5.5 against GPT-5.4 and Claude Opus 4.7, OpenAI’s chosen public Claude comparator.

The overlapping benchmarks stack up like this:

Benchmark	Mythos (gated)	GPT-5.5	GPT-5.5 Pro	Opus 4.7†	Notes
SWE-bench Pro	77.8%	58.6%	—	64.3%	Memorization concern¹
Terminal-Bench 2.0	82% / 92.1%²	82.7%	—	69.4%	Different harnesses²
GPQA Diamond	94.5%	93.6%	—	94.2%	At saturation
HLE, no tools	56.8%	41.4%	43.1%	46.9%	Largest clean Mythos lead
HLE, with tools	64.7%	52.2%	57.2%	54.7%	Different tool stacks
BrowseComp	86.9%	84.4%	90.1%	79.3%	Contamination flagged³
CyberGym	83%	81.8%	—	73.1%	—
OSWorld-Verified	79.6%	78.7%	—	78.0%	—
GraphWalks (long context)	80.0%	45.4%⁴	—	41.2%⁴	Different bucketing⁴

Caveats: † Opus 4.7 figures are drawn from OpenAI’s GPT-5.5 system card. Opus 4.7 was released between Mythos and GPT-5.5, so the Mythos card does not include Opus 4.7 numbers. ¹ SWE-bench Pro carries a known memorization concern. Anthropic published a filter re-score analysis showing Mythos’s margin holds on decontaminated subsets. OpenAI’s table flags “evidence of memorization” with an asterisk but publishes no filter analysis. ²

Other considerations: OpenAI ran Terminal-Bench 2.0 with a specialized Codex CLI harness. Anthropic used the Terminus-2 scaffold. Under Terminal-Bench 2.1 fixes with 4-hour timeouts, Anthropic measured Mythos at 92.1% and GPT-5.4 at 75.3% on the same setup. OpenAI did not disclose a matched re-run. ³ Anthropic reports a 15.1% short-transcript upper bound for memorization on BrowseComp. ⁴ Anthropic aggregates BFS 256K to 1M. OpenAI splits 256K (GPT-5.5: 73.7%) and 1M (GPT-5.5: 45.4%). OpenAI’s 1M Opus figure is actually Opus 4.6, not 4.7. The direct comparison is soft on both ends.

How OpenAI’s recently released GPT-5.5 stacks up with Anthropic’s gated Claude Mythos

Tell Us What You Think! Cancel reply

Tell Us What You Think! Cancel reply

Related Articles Read More >

Roblonski platform automates photochemistry with 1,000-fold reduction in reagent use

SEQSTER CEO Ardy Arianpour on AWS Bio Discovery, fragmented health records and AI drug discovery’s missing patient layer

Five key trends that defined the show floor at Interphex 2026

Inside the Viz.ai–J&J deal to speed subdural hemorrhage care, from CT scan to treatment decision

Search R&D World