How Xaira aims to fuel biology’s ‘ImageNet moment’ with a 521-GB open-source dataset

A heatmap illustrating the biological accuracy of the X-Atlas/Orion dataset. The analysis correctly clustered genes into known functional groups—such as those for ribosome biogenesis (purple) and protein synthesis (red)—based solely on their downstream effects, validating the quality of the data.Image Credit: Huang et al., bioRxiv (2025), Figure 3C.

A heatmap illustrating the biological accuracy of the X-Atlas/Orion dataset. The analysis clustered genes into known functional groups, such as those for ribosome biogenesis (purple) and protein synthesis (red) based solely on how cells responded to changes. Image Credit: Huang et al., bioRxiv (2025), Figure 3C.

Traditional drug discovery is something of a blend of artisanal craft and engineering. To help push the field towards industrialization, Xaira Therapeutics, the $1 billion AI drug discovery unicorn co-founded by Nobel laureate David Baker, has released X-Atlas/Orion, an 8-million-cell open-source dataset detailing not just if a genetic change affects a cell, but by how much. The resource has already been downloaded more than 16,451 times at the time of writing, just two weeks after its release.

“We really see the whole field of causal biology modeling as data-bound or data-starved because there’s really not a lot of high-quality perturbation data out there,” said Ci Chu, Ph.D., VP of Early Discovery at Xaira Therapeutics. Perturb-seq combines CRISPR gene editing with single-cell RNA sequencing, allowing researchers to knock down thousands of genes in parallel and observe how each perturbation changes a cell’s behavior, essentially creating a giant cause-and-effect map of cellular biology. “By releasing two genome-wide Perturb-seq datasets, we really hope to jump-start the field,” he said. “We see this as fuel for training the next generation of foundation models [Ed note: Foundation models are AI systems pre-trained on vast amounts of data that can be fine-tuned for specific applications], much like ImageNet was very helpful for the computer vision field.”

Words/sentences can be described as a collection of words with the right order, and cells can be described as a collection of genes. —Chu

Ci Chu, Ph.D.

ImageNet, a dataset of over 14 million labeled images released in 2009, helped lay the groundwork for today’s AI wave by providing researchers with enough high-quality training data to develop increasingly sophisticated computer vision models. Xaira hopes to replicate that success for drug discovery.

The timing aligns with biology’s own ChatGPT moment. Researchers are now applying transformer-based architectures to create “single-cell GPTs (scGPTs)” that can predict cellular behavior much like language models predict text. But to move beyond simple correlation and understand causation, these models need to be trained on interventional data showing what happens when a system is actively changed.

“All of us are in the infancy of this field,” Chu said. He noted that answering foundational questions about model scaling, optimal data types, and scalable data generation will be crucial. “The quicker we get through some of the fundamental questions, the faster all of us benefit, including ourselves.”

Is there a scaling law in training biology models? What type of data is the most useful for improving model performance? How do we best scale data generation? —Chu

The stakes are high: many diseases remain “idiopathic” due to unclear causes. But even for well-studied diseases, the underlying causal mechanisms can be incredibly complex. Current models often struggle to move beyond simple correlations to predict how a cell will actually respond to a specific drug or genetic change.

“Traditionally, target discovery is really smart scientists reading a lot of literature and placing, hopefully, educated guesses or smart bets on what they can physically test in a lab,” Chu explained. Xaira’s approach aims to turn that artisanal process into an engineering discipline. Instead of going directly to the lab, researchers can first perform in silico queries, using the AI model to virtually screen thousands of perturbations. Only then would the top predictions move to physical testing. “You would only test the experiments on the most promising hypotheses that come out of this virtual screening process,” he said, which could ultimately “de-risk the success” of drug development.

While the processed X-Atlas/Orion datasets on Figshare clock in at just over 521 GB, the raw sequencing data was even larger. “We know that the complete raw data, data straight out of a sequencer, would be actually pretty challenging for most academic labs to process,” Chu explained. “So, to help the field, we released the pre-processed data that’s all the cells and all the gene expression of all the perturbations so that it’s much easier to take it to downstream analysis.”

An industrialized platform to ‘fix’ perturb-seq

Generating a dataset of this size and quality required Xaira to overcome significant technical hurdles that have historically constrained Perturb-seq experiments. The solution was an industrialized platform the company calls FiCS Perturb-seq (Fix-Cryopreserve-scRNA-seq). The name, Chu said, is fitting: “It’s a cute acronym that sounds like ‘fix,’ which is exactly one of the key points” of the platform, designed to eliminate scalability bottlenecks and reduce technical noise.

A common challenge with traditional fresh-cell protocols is that the processing itself can stress the cells, creating artifacts that confound the biological signals. For a genome-wide experiment involving hundreds of millions of cells, this processing can take hours. “The cells you get out at the end, you can imagine how stressed they are, or they might even be on the verge of cell death,” Chu explained. “All of that stress-induced signal… will confound your downstream analysis.”

To address this hurdle, Xaira’s platform uses a chemical fixation step that takes a “snapshot in time of the cell’s transcriptome,” making the cells immune to downstream handling. This technique, combined with cryopreservation and automation, allows researchers to decouple cell harvesting from analysis.

Beyond binary: Capturing dose-dependent effects

Another technical advance in X-Atlas/Orion is its ability to measure not just whether a gene has been knocked down, but by how much. That insight has significant ramifications for drug discovery where, as Chu emphasized, “dose is everything.”

Traditional Perturb-seq analyses treat genetic perturbations as binary events: either a gene is knocked down or it isn’t. But Xaira’s deep sequencing revealed substantial variability in knockdown efficiency across cells receiving the same perturbation. The breakthrough came when researchers discovered they could use sgRNA (single guide RNA) abundance as a reliable proxy for knockdown strength.

“We found that there’s a very positive correlation between the guide RNA expression level and the knockdown efficiency of its target gene,” Chu explained. “You can use the guide RNA expression level to filter out cells with poor knockdown. You can also use it to stratify cells with high knockdown and low knockdown, and that lets you create a reliable proxy measure to inform dose-dependent effects of genetic inhibition.”

This shift from binary to continuous measurement opens new possibilities for understanding complex biological phenomena like haploinsufficiency, where losing just one copy of a gene causes disease, and for predicting how cells will respond to different drug concentrations

Recovering known biology and uncovering new connections

The biological validity of X-Atlas/Orion was confirmed through its ability to recover known cellular pathways and protein complexes. When researchers compared their transcriptomic networks against StringDB, a curated database of known protein interactions, they found that “StringDB scores significantly increased with transcriptomic correlation strength. That suggests a strong concordance between networks derived from Perturb-seq correlations and known protein interaction networks.”

The dataset successfully identified functional clusters that align with established biology: genes involved in ribosome biogenesis grouped together, including “components of RNA polymerase I for rRNA transcription (POLR1G, POLR1E); factors involved in pre-rRNA processing (NOL12, ISG20L2); [and] proteins essential for nucleolar ribosome maturation (WDR75, PDCD11, URB1, URB2, DDX27, NOC3L).” Other recovered clusters included “the exosome complex, which is responsible for RNA degradation; the Mediator complex, the nuclear pore complex, and the eIF2B complex and aminoacyl-tRNA synthetases.”

One intriguing finding is that the data revealed that “perturbations of mitochondrial proteins (TIMM44, PHB1, SAMM50, and PRELID1) clustered with those of the eIF2B complex members.” That finding is consistent with recent reports linking mitochondrial injury to stress response pathways.

The next generation of virtual cells

The X-Atlas/Orion release marks just the beginning of Xaira’s vision for industrialized drug discovery. With the recent addition of Bo Wang, Ph.D., the pioneer behind scGPT, one of the first single-cell foundation models, the company is poised to leverage this data for building next-generation AI systems. “With Bo Wang joining, you can imagine our immediate next step is to start building our virtual cell models or foundation models of biology and to hopefully have really accurate perturbation prediction results,” Chu said.

Beyond expanding the dataset, Xaira plans to move from immortalized cell lines toward more disease-relevant systems. “We hope to transition increasingly away from these transformed cell line experiments into more physiological systems that mimic what happens in a disease context,” Chu explained.

Chu explained that future data generation will transition to more physiological systems: “So these are IPSC-derived systems, primary cells, or even direct in vivo screening experiments.”

The company is also positioning itself as a platform partner for the broader drug discovery ecosystem. While “the dataset that we put out there is free for the academic world, with a non-commercial license,” Chu noted that Xaira is “happy to work with any commercial entity who might be interested in collaborating with us.”

An industrialized platform to ‘fix’ perturb-seq

Beyond binary: Capturing dose-dependent effects

Recovering known biology and uncovering new connections

The next generation of virtual cells

Related Articles Read More >

Why Washington wants a 30-day look at frontier AI before it ships, and is backing a voluntary approach

Trump’s AI push turns government into reviewer, warfighter supplier and possible shareholder

OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks

Sanofi deepens its Owkin bet with a five-year deal to build bespoke drug-development agents

Search R&D World