SandboxAQ’s SAIR dataset turns 5.2 M protein‑ligand structures into ground‑truth fuel for AI

SandboxAQ bills SAIR (Structurally Augmented IC50 Repository) as “the largest public dataset of protein–ligand 3D structures paired with binding potency measurements.”

SandboxAQ, the Alphabet spinoff whose name reflects its work at the intersection of AI and quantum techniques, thinks testing drugs on animals is already passé. “It’s not so much that we’re going to somehow move a mouse model into computers,” said Nadia Harhen, general manager of AI simulation at the company. “It’s that we hope to be so predictive that you don’t need the mouse model in the first place. That’s the intent behind all of this.”

The “this” she is referring to encompasses the recently released SAIR dataset, 5.2 million synthetic protein-ligand structures that the firm just released for free. SAIR, which stands for Structurally Augmented IC50 Repository, gives researchers physics-grounded training data that promises predictions 1,000 times faster than traditional methods. (IC50 refers to the half-maximal inhibitory concentration, a standard measure of a drug’s potency.)

From structure to function

SandboxAQ by the numbers

Founded: 2022 (Alphabet spinoff)
Patents: 18 granted (10 from Good Chemistry Inc. acquisition)
SAIR dataset: 2.5 TB, 5.2M structures, 1M+ unique pairs
Computing feat with SAIR: 20 days with 90%+ GPU utilization
Industry adoption: 6 pharma companies in first 48 hours
Team growth: 75+ in AI Simulation (from handful in 2022)
Recognition: TIME’s Most Influential Companies 2024, Edison Award Gold
Investors: Eric Schmidt, Marc Benioff, Yann LeCun, T. Rowe Price

Note: Patent data sourced from Google Patents Public Dataset, analyzing patents published from January 2018 onwards for SandboxAQ and its acquired companies Good Chemistry Inc. and Cryptosense.

The SAIR dataset, which captures more than 1 million protein-ligand systems, and is available on Google Cloud Platform, addresses a gap that even state-of-the-art structural biology AI models don’t fully address. While platforms like Google DeepMind’s AlphaFold 3 can now predict the structure of complex protein-ligand interactions with high accuracy, they do not inherently predict the potency or binding affinity of that interaction. This is a missing puzzle piece for drug developers. Harhen explained the historical limitation of tools trained only on existing experimental data: when AlphaFold first emerged, it “covered everything in the Protein Data Bank at the time, but it didn’t really go beyond that. It didn’t cover the ‘undruggable’ world,” she said.

This created a problem: models trained on that data “would not generalize and extrapolate very well,” Harhen noted. SAIR was built to help address this shortfall by providing 5.2 million structures that are not only physics-grounded but are also explicitly labeled with their experimental IC50 potency data. This creates a large, ground-truth resource specifically designed to train next-generation AI models that can connect a molecule’s structure to its ultimate functional effectiveness.

Physics, not just language

The engineering achievement behind SAIR’s creation is as notable as its scope. Working with NVIDIA DGX Cloud, SandboxAQ achieved over 90% GPU use, a figure Harhen calls “unheard of” in the field. This optimization compressed what would typically require three to five years of computation into just 20 days. “Our teams [at SandboxAQ and NVIDIA] worked very, very closely together to achieve those types of statistics in order to create this dataset,” Harhen said. “It’s just incredible what you can do.”

Nadia Harhen

The dataset spans more than 1 million unique protein-ligand systems, each computationally generated with physics-based modeling rather than purely AI-driven predictions. “We didn’t train the algorithm,” Harhen emphasizes. “We released the dataset so that people can train their own algorithms based on the area of chemical space that they’re most interested in and fine-tune accordingly.”

This physics-first approach reflects SandboxAQ’s broader technical philosophy, which diverges from the language-model zeitgeist dominating AI headlines. The company’s Large Quantitative Models (LQMs) tackle a fundamentally different problem than their linguistic cousins. “LLMs get all the headlines, and rightfully so,” Harhen acknowledges. “But language is limited, specifically in quantitative applications when it comes to math, sciences and real B2B problems.”

While a language model can reason through some mathematical equations, she notes, “what it can’t do is actually solve multi-parameter optimizations over hundreds and thousands of parameters.” For drug discovery, where success means simultaneously optimizing potency, selectivity, toxicity, solubility and dozens of other properties, this distinction matters. “That is where our technology is inherently different,” Harhen said.

Language is a component of what we use, but the underlying fundamentals are physics-based. —Harhen

Toward a universal chemistry engine

Early reception suggests the industry was waiting for this kind of resource. Within the first 48 hours of release, six pharma companies had already requested and received access to SAIR. “It’s been met with more interest than we anticipated,” Harhen said. She added that hundreds of researchers signed up for the introductory webinar.

The timing aligns with shifting regulatory attitudes: the FDA recently issued stronger guidance encouraging computational approaches to replace animal testing. “I do think that the FDA is very open to these types of methods now, more than ever before,” Harhen observed. For companies downloading SAIR, the dataset’s physics-based grounding means they can skip traditional validation steps. When asked whether users would need to synthesize molecules from SAIR to confirm predictions through wet lab testing, Harhen was direct: “They could do that, but the idea is that they don’t have to. It’s already grounded in those types of data.”

Beyond pharma

The implications extend well beyond pharma labs. SandboxAQ’s platform treats molecular interactions as a universal computational challenge, whether the target is a protein binding site or a battery electrode. “It’s all interface and chemistry at the end of the day,” Harhen explained. “Catalysts are like enzymes; enzymes are in the body. You could see how many of the tools that we make on the bio side then translate to our chemical simulation work.” The company is already preparing its next dataset release focused on catalysts, with applications spanning from industrial chemistry to clean energy. This cross-pollination reflects SandboxAQ’s focus: “We’re not a therapeutics company, so we don’t pursue things based on a therapeutic area,” Harhen said. “We’re a technology company, so we invest based on the technology as aligned with our customer needs.”

From structure to function

SandboxAQ by the numbers

Physics, not just language

Toward a universal chemistry engine

Beyond pharma

Related Articles Read More >

Is Karpathy’s viral LLM wiki helpful? My opinion after one month of experimenting with one.

Leica, Indica Labs and Lunit team up as AI biomarker scoring moves toward clinical scale

Causaly and Microsoft target one of drug discovery’s most expensive decisions: which target to pursue

How Claude Fable 5 stacks up against Opus 4.8 and GPT 5.5

Search R&D World