Agentic coding benchmark

Terminal-Bench 2.1 model scores

Terminal-Bench 2.1 tests AI models on real-world coding tasks in terminal and command-line environments. Bars show OpenAI's reported scores for GPT‑5.6 and comparison models. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.

GPT‑5.6 Sol UltraOpenAI · multi-agent
91.9
GPT‑5.6 SolOpenAI
88.8
GPT‑5.5OpenAI · prior gen
88.0
GPT‑5.6 LunaOpenAI
84.3
Claude Mythos 5Anthropic
84.3
88.0
Claude Fable 5Anthropic · Mythos 5 weights
83.4
84.3
GPT‑5.6 TerraOpenAI
82.5
Claude Opus 4.8Anthropic
78.9
82.7
Gemini 3.1 Pro PreviewGoogle
70.7
0 25 50 75 100
GPT‑5.6 (current) Ultra mode (subagents) GPT‑5.5 (prior gen) Claude (Anthropic) Gemini (Google) Anthropic-harness score

Benchmark scope. Terminal-Bench 2.1 evaluates real-world coding tasks in terminal and command-line environments.

Bars and markers. Bars show OpenAI's reported Terminal-Bench 2.1 scores. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.

GPT‑5.5 on Anthropic's harness. The 88.0 bar is OpenAI's figure. On the same mini‑SWE‑agent harness used for the diamonds, Anthropic's card scores GPT‑5.5 at about 81 (Harbor's leaderboard reproduction) to 83 (Anthropic's own run), so its lead over the Claude models narrows or reverses once every model is read on one harness.

Ultra mode. GPT‑5.6 Sol Ultra uses OpenAI's subagent-based ultra mode, so it represents a multi-agent run.

Fable 5. Fable 5 uses the same underlying weights as Mythos 5. Anthropic reports 84.3 on its harness, with safety refusals on 20.9% of trials that then fell back to Claude Opus 4.8.

Sources: OpenAI GPT‑5.6 preview, bar scores. Anthropic Claude Fable 5 and Claude Mythos 5 System Card, diamond marker scores. Terminal-Bench 2.1 only. GeneBench and ExploitBench/ExploitGym are separate evaluations. Scores as of June 26, 2026.