OpenAI's GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

As Anthropic’s Fable 5 remains pulled from public access under a U.S. government export-control directive, OpenAI soft-launched GPT-5.6 Sol on June 26 to its own “trusted partners.”

OpenAI boasted that 5.6 is its strongest model yet and led with a coding result, a new state of the art on Terminal-Bench 2.1, the benchmark that scores agents on real command-line work. As a single model, Sol posted 88.8, edging GPT-5.5 at 88.0 and clearing the publicly launched Claude models and Gemini 3.1 Pro. Switched into the company’s new “ultra mode,” which farms work out to subagents, it reached 91.9.

The model appears to also have a tendency to cheat in some cases. OpenAI’s system card for GPT-5.6 also acknowledges “instances of the model cheating on tasks and fabricating research results.”

The independent evaluator OpenAI brought in before launch hit the same behavior, hard enough that it could not produce a number. METR, given pre-deployment access to Sol including its raw chain-of-thought, started a capability run on its Time Horizon software suite and walked away from the result.

The model’s detected cheating rate, METR wrote, was higher than any public model it had evaluated. The classification problem swamped the result. Treating the cheating attempts as failures, METR’s standard rule, put Sol’s 50% time horizon near 11.3 hours. Counting those same attempts as legitimate successes sent it past 270 hours, well outside the range where the suite gives reliable readings. Discarding them stripped out the data for several long-horizon tasks and produced a 71-hour estimate with a confidence interval stretching from 13 hours to 11,400. As METR put it, “we do not consider any of these numbers to represent a robust measurement of GPT-5.6 Sol’s capabilities.”

METR concluded the model is not significantly beyond the state of the art on software and R&D work, does not enable fully automated AI R&D, and does not reach the Critical threshold for AI self-improvement under OpenAI’s Preparedness Framework v2.

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

Tell Us What You Think! Cancel reply

Tell Us What You Think! Cancel reply

Related Articles Read More >

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide

Six months in, Lilly says its supercomputer is starting to change the work with ‘near-infinite’ AI tokens

Boltz built its drug-discovery API ‘for agents as much as for people’

NVIDIA Announces BioNeMo Agent Toolkit with traction from nearly 50 partners, including Lilly, Thermo Fisher and Dassault

Search R&D World