New Gemini 2.5 Pro model achieves top-tier science and coding performance

Just over two months after R&D World covered the debut of Google’s experimental Gemini 2.5 Pro, the company has announced a significant upgrade following a prior May update to the model that drew mixed reactions from developers.

Google’s latest Gemini 2.5 Pro has jumped to the top of AI performance rankings with an Elo score of 1470, a chess-style rating system that measures how often models beat each other in head-to-head comparisons based on user feedback. The 24-point Elo jump solidifies Gemini’s lead on LMArena, the widely-watched AI leaderboard, even as it costs just $1.25 per million input tokens compared to OpenAI’s o3 at $10.00. The pricing difference is even greater when comparing to Claude Opus 4, costing one twelfth for input tokens.

While large language models are gaining ground especially quickly in web development, they are making strides in science, too. The latest Gemini 2.5 model now scores 86.4% on the GPQA Diamond benchmark, a difficult test of graduate-level scientific knowledge where it outperforms OpenAI’s o3 (83.3%) and Claude Opus 4 (79.6%) in single-attempt evaluations. In terms of code editing, it leads the pack on Aider Polyglot with a score of 82.2%.

From the LMArena leaderboard:

In announcing the upgrade, Google emphasized that this version addresses previous critiques while positioning the model for broader commercial use. “We also addressed feedback from our previous 2.5 Pro release, improving its style and structure — it can be more creative with better-formatted responses,” wrote Tulsee Doshi, senior director, product management at Google DeepMind. Doshi describes the updated model as “our most intelligent model yet” and notes it “will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications.”

Related Articles Read More >

Elsevier expands LeapSpace with writing coach and Claim Radar, says 97% of users report time savings from the platform

Anthropic says Claude can run science experiments now rather than just plan them

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide

Search R&D World