Agentic coding benchmark

Terminal-Bench 2.1 model scores

Terminal-Bench 2.1 tests AI models on real-world coding tasks in terminal and command-line environments. Bars show OpenAI's reported scores for GPT‑5.6 and comparison models. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.

GPT‑5.6 Sol UltraOpenAI · multi-agent

91.9

GPT‑5.6 SolOpenAI

88.8

GPT‑5.5OpenAI · prior gen

88.0

GPT‑5.6 LunaOpenAI

84.3

Claude Mythos 5Anthropic

84.3

88.0

Claude Fable 5Anthropic · Mythos 5 weights

83.4

84.3

GPT‑5.6 TerraOpenAI

82.5

Claude Opus 4.8Anthropic

78.9

82.7

Gemini 3.1 Pro PreviewGoogle

70.7

0 25 50 75 100

GPT‑5.6 (current) Ultra mode (subagents) GPT‑5.5 (prior gen) Claude (Anthropic) Gemini (Google) Anthropic-harness score

Benchmark scope. Terminal-Bench 2.1 evaluates real-world coding tasks in terminal and command-line environments.

Bars and markers. Bars show OpenAI's reported Terminal-Bench 2.1 scores. Diamonds show Anthropic's mini‑SWE‑agent harness scores for Claude models.

GPT‑5.5 on Anthropic's harness. The 88.0 bar is OpenAI's figure. On the same mini‑SWE‑agent harness used for the diamonds, Anthropic's card scores GPT‑5.5 at about 81 (Harbor's leaderboard reproduction) to 83 (Anthropic's own run), so its lead over the Claude models narrows or reverses once every model is read on one harness.

Ultra mode. GPT‑5.6 Sol Ultra uses OpenAI's subagent-based ultra mode, so it represents a multi-agent run.

Fable 5. Fable 5 uses the same underlying weights as Mythos 5. Anthropic reports 84.3 on its harness, with safety refusals on 20.9% of trials that then fell back to Claude Opus 4.8.

Sources: OpenAI GPT‑5.6 preview, bar scores. Anthropic Claude Fable 5 and Claude Mythos 5 System Card, diamond marker scores. Terminal-Bench 2.1 only. GeneBench and ExploitBench/ExploitGym are separate evaluations. Scores as of June 26, 2026.