
[Adobe Stock]
The released Opus 4.7 and GPT-5.5 are neck and neck. Opus 4.7 leads on SWE-bench Pro (64.3 vs 58.6), HLE no tools (46.9 vs 41.4) and HLE with tools (54.7 vs 52.2).
- Mythos leads: SWE-bench Pro (77.8 vs 58.6 / 64.3), HLE no tools (56.8 vs 41.4 / 46.9), HLE with tools (64.7 vs 52.2 / 54.7), CyberGym (83 vs 81.8 / 73.1), OSWorld-Verified (79.6 vs 78.7 / 78.0), GraphWalks long context (soft comparison but 80 vs 45.4).
- Effectively tied: GPQA Diamond (94.5 vs 93.6 / 94.2) and Terminal-Bench 2.0 on the headline numbers (82 vs 82.7). Anthropic’s 92.1% Terminal-Bench 2.1 result shows how sensitive the benchmark is to harness and timeout choices, but it is not a matched GPT-5.5 comparison.
- GPT-5.5 Pro leads: BrowseComp (90.1 vs 86.9). That’s the only benchmark in the overlap where any GPT-5.5 variant clearly beats Mythos by a meaningful margin.
- Opus 4.7 leads GPT-5.5 base on: SWE-bench Pro, HLE with tools. Ties on GPQA and OSWorld. Loses on Terminal-Bench, BrowseComp, CyberGym.
OpenAI released GPT-5.5 to ChatGPT and Codex on Wednesday, calling it “a new class of intelligence for real work.” Two weeks earlier, Anthropic published a 245-page system card for Claude Mythos Preview, a model it said was sufficiently capable at vulnerability discovery and exploitation that the company decided against general release. (OpenAI first withheld the full release of its GPT-2 language model in February 2019.)
Meanwhile, the New York Times recently reported that Anthropic’s decision has set off a global scramble, with the Bank of England governor warning that Anthropic may have found a way to “crack the whole cyber-risk world open.” The European Commission met with Anthropic at least three times without securing access, according to the Times.
Neither system card places the two models head to head. Anthropic compares Mythos primarily against Claude Opus 4.6, GPT-5.4 and Gemini 3.1 Pro. OpenAI compares GPT-5.5 against GPT-5.4 and Claude Opus 4.7, OpenAI’s chosen public Claude comparator.
The overlapping benchmarks stack up like this:
| Benchmark | Mythos (gated) | GPT-5.5 | GPT-5.5 Pro | Opus 4.7† | Notes |
|---|---|---|---|---|---|
| SWE-bench Pro | 77.8% | 58.6% | — | 64.3% | Memorization concern¹ |
| Terminal-Bench 2.0 | 82% / 92.1%² | 82.7% | — | 69.4% | Different harnesses² |
| GPQA Diamond | 94.5% | 93.6% | — | 94.2% | At saturation |
| HLE, no tools | 56.8% | 41.4% | 43.1% | 46.9% | Largest clean Mythos lead |
| HLE, with tools | 64.7% | 52.2% | 57.2% | 54.7% | Different tool stacks |
| BrowseComp | 86.9% | 84.4% | 90.1% | 79.3% | Contamination flagged³ |
| CyberGym | 83% | 81.8% | — | 73.1% | — |
| OSWorld-Verified | 79.6% | 78.7% | — | 78.0% | — |
| GraphWalks (long context) | 80.0% | 45.4%⁴ | — | 41.2%⁴ | Different bucketing⁴ |
Caveats: † Opus 4.7 figures are drawn from OpenAI’s GPT-5.5 system card. Opus 4.7 was released between Mythos and GPT-5.5, so the Mythos card does not include Opus 4.7 numbers. ¹ SWE-bench Pro carries a known memorization concern. Anthropic published a filter re-score analysis showing Mythos’s margin holds on decontaminated subsets. OpenAI’s table flags “evidence of memorization” with an asterisk but publishes no filter analysis. ²
Other considerations: OpenAI ran Terminal-Bench 2.0 with a specialized Codex CLI harness. Anthropic used the Terminus-2 scaffold. Under Terminal-Bench 2.1 fixes with 4-hour timeouts, Anthropic measured Mythos at 92.1% and GPT-5.4 at 75.3% on the same setup. OpenAI did not disclose a matched re-run. ³ Anthropic reports a 15.1% short-transcript upper bound for memorization on BrowseComp. ⁴ Anthropic aggregates BFS 256K to 1M. OpenAI splits 256K (GPT-5.5: 73.7%) and 1M (GPT-5.5: 45.4%). OpenAI’s 1M Opus figure is actually Opus 4.6, not 4.7. The direct comparison is soft on both ends.



Tell Us What You Think!
You must be logged in to post a comment.