Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

How OpenAI’s recently released GPT-5.5 stacks up with Anthropic’s gated Claude Mythos

By Brian Buntz | April 23, 2026

[Adobe Stock]

TL;DR: Claude Mythos Preview appears to lead cleanly on six of nine overlapping rows, especially SWE-bench Pro and Humanity’s Last Exam (HLE), but benchmark comparisons between Mythos and GPT-5.5 are imprecise.

The released Opus 4.7 and GPT-5.5 are neck and neck. Opus 4.7 leads on SWE-bench Pro (64.3 vs 58.6), HLE no tools (46.9 vs 41.4) and HLE with tools (54.7 vs 52.2).

  • Mythos leads: SWE-bench Pro (77.8 vs 58.6 / 64.3), HLE no tools (56.8 vs 41.4 / 46.9), HLE with tools (64.7 vs 52.2 / 54.7), CyberGym (83 vs 81.8 / 73.1), OSWorld-Verified (79.6 vs 78.7 / 78.0), GraphWalks long context (soft comparison but 80 vs 45.4).
  • Effectively tied: GPQA Diamond (94.5 vs 93.6 / 94.2) and Terminal-Bench 2.0 on the headline numbers (82 vs 82.7). Anthropic’s 92.1% Terminal-Bench 2.1 result shows how sensitive the benchmark is to harness and timeout choices, but it is not a matched GPT-5.5 comparison.
  • GPT-5.5 Pro leads: BrowseComp (90.1 vs 86.9). That’s the only benchmark in the overlap where any GPT-5.5 variant clearly beats Mythos by a meaningful margin.
  • Opus 4.7 leads GPT-5.5 base on: SWE-bench Pro, HLE with tools. Ties on GPQA and OSWorld. Loses on Terminal-Bench, BrowseComp, CyberGym.

OpenAI released GPT-5.5 to ChatGPT and Codex on Wednesday, calling it “a new class of intelligence for real work.” Two weeks earlier, Anthropic published a 245-page system card for Claude Mythos Preview, a model it said was sufficiently capable at vulnerability discovery and exploitation that the company decided against general release. (OpenAI first withheld the full release of its GPT-2 language model in February 2019.)

Meanwhile, the New York Times recently reported that Anthropic’s decision has set off a global scramble, with the Bank of England governor warning that Anthropic may have found a way to “crack the whole cyber-risk world open.” The European Commission met with Anthropic at least three times without securing access, according to the Times.

Neither system card places the two models head to head. Anthropic compares Mythos primarily against Claude Opus 4.6, GPT-5.4 and Gemini 3.1 Pro. OpenAI compares GPT-5.5 against GPT-5.4 and Claude Opus 4.7, OpenAI’s chosen public Claude comparator.

The overlapping benchmarks stack up like this:

Benchmark Mythos (gated) GPT-5.5 GPT-5.5 Pro Opus 4.7† Notes
SWE-bench Pro 77.8% 58.6% — 64.3% Memorization concern¹
Terminal-Bench 2.0 82% / 92.1%² 82.7% — 69.4% Different harnesses²
GPQA Diamond 94.5% 93.6% — 94.2% At saturation
HLE, no tools 56.8% 41.4% 43.1% 46.9% Largest clean Mythos lead
HLE, with tools 64.7% 52.2% 57.2% 54.7% Different tool stacks
BrowseComp 86.9% 84.4% 90.1% 79.3% Contamination flagged³
CyberGym 83% 81.8% — 73.1% —
OSWorld-Verified 79.6% 78.7% — 78.0% —
GraphWalks (long context) 80.0% 45.4%⁴ — 41.2%⁴ Different bucketing⁴

 

Caveats: † Opus 4.7 figures are drawn from OpenAI’s GPT-5.5 system card. Opus 4.7 was released between Mythos and GPT-5.5, so the Mythos card does not include Opus 4.7 numbers. ¹ SWE-bench Pro carries a known memorization concern. Anthropic published a filter re-score analysis showing Mythos’s margin holds on decontaminated subsets. OpenAI’s table flags “evidence of memorization” with an asterisk but publishes no filter analysis. ²

Other considerations: OpenAI ran Terminal-Bench 2.0 with a specialized Codex CLI harness. Anthropic used the Terminus-2 scaffold. Under Terminal-Bench 2.1 fixes with 4-hour timeouts, Anthropic measured Mythos at 92.1% and GPT-5.4 at 75.3% on the same setup. OpenAI did not disclose a matched re-run. ³ Anthropic reports a 15.1% short-transcript upper bound for memorization on BrowseComp. ⁴ Anthropic aggregates BFS 256K to 1M. OpenAI splits 256K (GPT-5.5: 73.7%) and 1M (GPT-5.5: 45.4%). OpenAI’s 1M Opus figure is actually Opus 4.6, not 4.7. The direct comparison is soft on both ends.

Tell Us What You Think! Cancel reply

You must be logged in to post a comment.

Related Articles Read More >

Roblonski platform automates photochemistry with 1,000-fold reduction in reagent use
SEQSTER CEO Ardy Arianpour on AWS Bio Discovery, fragmented health records and AI drug discovery’s missing patient layer
Five key trends that defined the show floor at Interphex 2026 
Inside the Viz.ai–J&J deal to speed subdural hemorrhage care, from CT scan to treatment decision
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.

R&D World Digital Issues

Fall 2025 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

R&D 100 Awards
Research & Development World
  • Subscribe to R&D World Magazine
  • Sign up for R&D World’s newsletter
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2026 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE