The unicorn biotechnology startup Xaira Therapeutics, co-founded by 2024 Nobel Prize winner David Baker, has released what it claims is the largest public genome-wide Perturb-seq dataset to date. Freely available, the data is available as a 520 GB download and comes with an associated pre-print.
Comprising eight million cells, the X-Atlas/Orion dataset dwarfs previous public releases, which typically contained fewer than one million cells. This represents an eight-fold jump in the scale of data available for training biological AI models. In addition, the dataset features more than 16,000 unique molecular identifiers (UMIs) per cell. That represents a sequencing depth nearly 10 times greater than other major atlases, thus enabling researchers to detect subtle biological changes from an array of genetic perturbations.

This diagram outlines Xaira’s industrialized Perturb-seq platform. The new workflow (B) improves on standard methods (A) by using chemical fixation, cryopreservation, and automation to enable large-scale screens, which are then processed by a computational pipeline (C). Credit: Huang et al., bioRxiv (2025)
“This industrialized platform and the Orion dataset will empower scientists to build more predictive models of complex biology,” said Ci Chu, vice president of early discovery at Xaira and senior author of the preprint, in a statement. He added:
We believe this will help us better understand disease biology and discover drug targets.
One of the most significant advances in the data may be Xaira’s underlying method for detecting dose-dependent genetic effects. Rather than viewing gene knockdowns as simple on-off switches, the standard approach in the field, the company’s researchers discovered they could track the amount of single guide RNA (sgRNA) in each cell to gauge precisely how much a gene’s activity has been dialed down. This transforms what was once a binary readout into a detailed spectrum of genetic activity. As the authors state in the preprint, “By treating perturbation as a continuous variable, our approach offers a more refined framework to enhance the predictive power and biological insight of future causal models.” This promises to dramatically improve researchers’ ability to predict how cells will respond to different interventions.