Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

SandboxAQ’s SAIR dataset turns 5.2 M protein‑ligand structures into ground‑truth fuel for AI

By Brian Buntz | July 2, 2025

SandboxAQ bills SAIR (Structurally Augmented IC50 Repository) as “the largest public dataset of protein–ligand 3D structures paired with binding potency measurements.”

SandboxAQ, the Alphabet spinoff whose name reflects its work at the intersection of AI and quantum techniques, thinks testing drugs on animals is already passé. “It’s not so much that we’re going to somehow move a mouse model into computers,” said Nadia Harhen, general manager of AI simulation at the company. “It’s that we hope to be so predictive that you don’t need the mouse model in the first place. That’s the intent behind all of this.”

The “this” she is referring to encompasses the recently released SAIR dataset, 5.2 million synthetic protein-ligand structures that the firm just released for free. SAIR, which stands for Structurally Augmented IC50 Repository, gives researchers physics-grounded training data that promises predictions 1,000 times faster than traditional methods. (IC50 refers to the half-maximal inhibitory concentration, a standard measure of a drug’s potency.)

From structure to function

SandboxAQ by the numbers

  • Founded: 2022 (Alphabet spinoff)
  • Patents: 18 granted (10 from Good Chemistry Inc. acquisition)
  • SAIR dataset: 2.5 TB, 5.2M structures, 1M+ unique pairs
  • Computing feat with SAIR: 20 days with 90%+ GPU utilization
  • Industry adoption: 6 pharma companies in first 48 hours
  • Team growth: 75+ in AI Simulation (from handful in 2022)
  • Recognition: TIME’s Most Influential Companies 2024, Edison Award Gold
  • Investors: Eric Schmidt, Marc Benioff, Yann LeCun, T. Rowe Price

Note: Patent data sourced from Google Patents Public Dataset, analyzing patents published from January 2018 onwards for SandboxAQ and its acquired companies Good Chemistry Inc. and Cryptosense.

The SAIR dataset, which captures more than 1 million protein-ligand systems, and is available on Google Cloud Platform, addresses a gap that even state-of-the-art structural biology AI models don’t fully address. While platforms like Google DeepMind’s AlphaFold 3 can now predict the structure of complex protein-ligand interactions with high accuracy, they do not inherently predict the potency or binding affinity of that interaction. This is a missing puzzle piece for drug developers. Harhen explained the historical limitation of tools trained only on existing experimental data: when AlphaFold first emerged, it “covered everything in the Protein Data Bank at the time, but it didn’t really go beyond that. It didn’t cover the ‘undruggable’ world,” she said.

This created a problem: models trained on that data “would not generalize and extrapolate very well,” Harhen noted. SAIR was built to help address this shortfall by providing 5.2 million structures that are not only physics-grounded but are also explicitly labeled with their experimental IC50 potency data. This creates a large, ground-truth resource specifically designed to train next-generation AI models that can connect a molecule’s structure to its ultimate functional effectiveness.

Physics, not just language

The engineering achievement behind SAIR’s creation is as notable as its scope. Working with NVIDIA DGX Cloud, SandboxAQ achieved over 90% GPU use, a figure Harhen calls “unheard of” in the field. This optimization compressed what would typically require three to five years of computation into just 20 days. “Our teams [at SandboxAQ and NVIDIA] worked very, very closely together to achieve those types of statistics in order to create this dataset,” Harhen said. “It’s just incredible what you can do.”

Nadia Harhen

Nadia Harhen

The dataset spans more than 1 million unique protein-ligand systems, each computationally generated with physics-based modeling rather than purely AI-driven predictions. “We didn’t train the algorithm,” Harhen emphasizes. “We released the dataset so that people can train their own algorithms based on the area of chemical space that they’re most interested in and fine-tune accordingly.”

This physics-first approach reflects SandboxAQ’s broader technical philosophy, which diverges from the language-model zeitgeist dominating AI headlines. The company’s Large Quantitative Models (LQMs) tackle a fundamentally different problem than their linguistic cousins. “LLMs get all the headlines, and rightfully so,” Harhen acknowledges. “But language is limited, specifically in quantitative applications when it comes to math, sciences and real B2B problems.”

While a language model can reason through some mathematical equations, she notes, “what it can’t do is actually solve multi-parameter optimizations over hundreds and thousands of parameters.” For drug discovery, where success means simultaneously optimizing potency, selectivity, toxicity, solubility and dozens of other properties, this distinction matters. “That is where our technology is inherently different,” Harhen said.

Language is a component of what we use, but the underlying fundamentals are physics-based. —Harhen

Toward a universal chemistry engine

Early reception suggests the industry was waiting for this kind of resource. Within the first 48 hours of release, six pharma companies had already requested and received access to SAIR. “It’s been met with more interest than we anticipated,” Harhen said. She added that hundreds of researchers signed up for the introductory webinar.

The timing aligns with shifting regulatory attitudes: the FDA recently issued stronger guidance encouraging computational approaches to replace animal testing. “I do think that the FDA is very open to these types of methods now, more than ever before,” Harhen observed. For companies downloading SAIR, the dataset’s physics-based grounding means they can skip traditional validation steps. When asked whether users would need to synthesize molecules from SAIR to confirm predictions through wet lab testing, Harhen was direct: “They could do that, but the idea is that they don’t have to. It’s already grounded in those types of data.”

Beyond pharma

The implications extend well beyond pharma labs. SandboxAQ’s platform treats molecular interactions as a universal computational challenge, whether the target is a protein binding site or a battery electrode. “It’s all interface and chemistry at the end of the day,” Harhen explained. “Catalysts are like enzymes; enzymes are in the body. You could see how many of the tools that we make on the bio side then translate to our chemical simulation work.” The company is already preparing its next dataset release focused on catalysts, with applications spanning from industrial chemistry to clean energy. This cross-pollination reflects SandboxAQ’s focus: “We’re not a therapeutics company, so we don’t pursue things based on a therapeutic area,” Harhen said. “We’re a technology company, so we invest based on the technology as aligned with our customer needs.”

Related Articles Read More >

IBM’s second-generation, 156-qubit Quantum Heron processors offer reduced error rates, 16× better performance, and 25× faster speeds than 2022 systems. The Heron can run quantum circuits with up to 5,000 two-qubit gate operations using Qiskit—nearly double what IBM achieved in 2023.
Quantum computing edges closer to biotech reality in Moderna-IBM pact
Hands-on with Patsnap’s Eureka Scout: Strong features meet evolving AI backbone
Researchers developed an AI tool to help build greener buildings
8 R&D developments to keep an eye on this week: A $12B AI unicorn, gut microbes vs. ‘forever chemicals’ and a record-breaking black hole
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE