Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

‘Best coding model’ or just better? Anthropic launches Claude Sonnet 4.5 to a skeptical crowd

By Brian Buntz | September 30, 2025

Anthropic art from the Sonnet 4.5 launch

Anthropic art from the Sonnet 4.5 launch

Anthropic boasts that its latest model, Claude Sonnet 4.5 is “the best coding model in the world.” It says: “It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.”

The debut also lands amid a noisy few weeks of user gripes about regressions and limits of prior models, setting up 4.5 as both a technical upgrade and a reputational reset.

So far, feedback online for the model, as it was for OpenAI’s GPT-5, is mixed.

Early adopters of Sonnet 4.5 cite quirky but improved coding capabilities while others claim that the model is overly assumptive. Multiple subscribers to the company’s Pro and Max tiers, say they’re capping out faster than before, with some alleging silent reductions and questioning value at higher tiers.

In terms of claimed benchmarks, Sonnet 4.5 leads Anthropic’s own chart on SWE-bench Verified (n=500), posting 77.2% accuracy in standard runs and 82.0% with parallel test-time compute (the asterisked figure). That edges its prior state-of-the-art model Opus 4.1 (74.5% / 79.4%), Sonnet 4 (72.7% / 80.2%), and tops the single-run scores shown for OpenAI’s specialized coding model GPT-5 Codex (74.5%), GPT-5 (72.8%) and Google’s Gemini 2.5 Pro (67.2%).

To understand what these numbers mean: SWE-bench Verified is a real-world software engineering benchmark built from GitHub issues in popular Python repositories, where each task presents the model with a failing test or issue along with the full codebase, and the model must propose a patch that actually fixes the bug. A task only counts as correct if the patch applies cleanly and all the repository’s tests pass in a clean environment. There’s no partial credit. The “n=500” designation means this run used a 500-problem, human-curated subset with reliable setup and ground-truth fixes, and the accuracy percentages represent the solve rate: how many of those 500 issues the model successfully fixed end-to-end.

The asterisked scores come with a caveat. Anthropic’s methodology for the aforementioned 82% figure with parallel test-time compute involves sampling multiple parallel attempts, discarding any patches that break visible regression tests, then using an internal scoring model to select the best candidate from the remaining attempts, essentially giving the model multiple tries and picking the best one. The 77.2% single-run score is the more conservative baseline, though Anthropic notes it used a minor prompt addition instructing the model to “use tools as much as possible, ideally more than 100 times” and to “implement your own tests first before attempting the problem.” As of September 30, Claude Sonnet 4.5 has not yet appeared on key third-party leaderboards like LiveCodeBench (which updates monthly with fresh problems to reduce training contamination) or LMSYS Chatbot Arena’s crowd-sourced rankings.

A cross-vendor scorecard. Independent tester Artificial Analysis ranks Sonnet 4.5 (Thinking) #4 on its Artificial Analysis Intelligence Index, a composite of 10 public evals: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, TerminalBench-Hard and τ²-Bench Telecom. Sonnet 4.5 scores 61, up from 59 for Claude 4.1 Opus (Thinking) and 57 for Claude 4 Sonnet (Thinking). On the same scale, GPT-5 (high) leads at 68, with Grok 4 at 65 and Gemini 2.5 Pro and Grok 4 Fast at 60. In non-reasoning mode, Sonnet 4.5 rises to 49 (from 44). The index also notes token-efficiency gains for Sonnet 4.5 in reasoning mode.

Claude Sonnet 4.5 in context: Artificial Analysis Intelligence Index rankings

Model Score (Thinking Mode) Score (Standard)
GPT-5 (high) 68 –
Grok 4 65 –
Gemini 2.5 Pro 60 –
Claude Sonnet 4.5 61 49
Claude Opus 4.1 59 –
Claude Sonnet 4 57 44

Composite score across 10 public benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, TerminalBench-Hard, τ²-Bench Telecom

Related Articles Read More >

Study shows LLMs can diagnose ER patients more accurately than physicians
These microrobots can collect nanoplastics from water
New Pistoia Alliance survey shows just 1% of professionals report AI’s value in the wet lab
The startup JustPaid drew inspiration from the sitcom Silicon Valley gag into an AI engineer
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.

R&D World Digital Issues

Fall 2025 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

R&D 100 Awards
Research & Development World
  • Subscribe to R&D World Magazine
  • Sign up for R&D World’s newsletter
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2026 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE