Agentic coding benchmark
Terminal-Bench 2.1 tests AI models on real-world coding tasks in terminal and command-line environments. Bars show OpenAI's reported scores for GPT‑5.6 and comparison models. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.
Benchmark scope. Terminal-Bench 2.1 evaluates real-world coding tasks in terminal and command-line environments.
Bars and markers. Bars show OpenAI's reported Terminal-Bench 2.1 scores. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.
GPT‑5.5 on Anthropic's harness. The 88.0 bar is OpenAI's figure. On the same mini‑SWE‑agent harness used for the diamonds, Anthropic's card scores GPT‑5.5 at about 81 (Harbor's leaderboard reproduction) to 83 (Anthropic's own run), so its lead over the Claude models narrows or reverses once every model is read on one harness.
Ultra mode. GPT‑5.6 Sol Ultra uses OpenAI's subagent-based ultra mode, so it represents a multi-agent run.
Fable 5. Fable 5 uses the same underlying weights as Mythos 5. Anthropic reports 84.3 on its harness, with safety refusals on 20.9% of trials that then fell back to Claude Opus 4.8.
Sources: OpenAI GPT‑5.6 preview, bar scores. Anthropic Claude Fable 5 and Claude Mythos 5 System Card, diamond marker scores. Terminal-Bench 2.1 only. GeneBench and ExploitBench/ExploitGym are separate evaluations. Scores as of June 26, 2026.