Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

By Brian Buntz | June 26, 2026

As Anthropic’s Fable 5 remains pulled from public access under a U.S. government export-control directive, OpenAI soft-launched GPT-5.6 Sol on June 26 to its own “trusted partners.”

OpenAI boasted that 5.6 is its strongest model yet and led with a coding result, a new state of the art on Terminal-Bench 2.1, the benchmark that scores agents on real command-line work. As a single model, Sol posted 88.8, edging GPT-5.5 at 88.0 and clearing the publicly launched Claude models and Gemini 3.1 Pro. Switched into the company’s new “ultra mode,” which farms work out to subagents, it reached 91.9.

The model appears to also have a tendency to cheat in some cases. OpenAI’s system card for GPT-5.6 also acknowledges “instances of the model cheating on tasks and fabricating research results.”

The independent evaluator OpenAI brought in before launch hit the same behavior, hard enough that it could not produce a number. METR, given pre-deployment access to Sol including its raw chain-of-thought, started a capability run on its Time Horizon software suite and walked away from the result.

The model’s detected cheating rate, METR wrote, was higher than any public model it had evaluated. The classification problem swamped the result. Treating the cheating attempts as failures, METR’s standard rule, put Sol’s 50% time horizon near 11.3 hours. Counting those same attempts as legitimate successes sent it past 270 hours, well outside the range where the suite gives reliable readings. Discarding them stripped out the data for several long-horizon tasks and produced a 71-hour estimate with a confidence interval stretching from 13 hours to 11,400. As METR put it, “we do not consider any of these numbers to represent a robust measurement of GPT-5.6 Sol’s capabilities.”

METR concluded the model is not significantly beyond the state of the art on software and R&D work, does not enable fully automated AI R&D, and does not reach the Critical threshold for AI self-improvement under OpenAI’s Preparedness Framework v2.

Tell Us What You Think! Cancel reply

You must be logged in to post a comment.

Related Articles Read More >

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide
Six months in, Lilly says its supercomputer is starting to change the work with ‘near-infinite’ AI tokens
Boltz built its drug-discovery API ‘for agents as much as for people’
NVIDIA Announces BioNeMo Agent Toolkit with traction from nearly 50 partners, including Lilly, Thermo Fisher and Dassault
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.

R&D World Digital Issues

Fall 2025 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

R&D 100 Awards
Research & Development World
  • Subscribe to R&D World Magazine
  • Sign up for R&D World’s newsletter
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2026 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE