3 'radical' premises for pharma data projects

[Adobe Firefly]

A good number of organizations rushing to deploy AI in pharmaceutical and genomics research have their priorities backward. They need not only a solid data infrastructure but also true data savviness before forging ahead with deep learning or large language models, argues Stavros Papadopoulos, CEO and founder of TileDB, the developer of data platform for scientific discovery, including an array of life science applications,

While AI-driven breakthroughs are understandably appealing, the underlying data challenges confronting life sciences organizations run deep. According to Deloitte’s 2024 Global Life Sciences Sector Outlook, nearly 40% of potential AI-driven pharma productivity gains are R&D-based. Large pharma companies could save $5–7 billion over five years if they close the AI implementation gap, the report reckons. Yet these gains hinge on having rigorous infrastructure, better data governance, and new approaches to collaboration—precisely the groundwork Papadopoulos insists on before fully embracing AI.

At present, many research teams remain stuck in inefficiency and fragmentation. Only about 16% of drug discovery efforts use AI, even though triple-digit growth is projected. Data scientists often spend up to 80% of their time preparing data rather than analyzing it. Compensation disparities exist too: while pharmaceutical data scientists average around $124,000 a year according to Glassdoor, top technology firms pay well beyond $200,000 with the full compensation package worth around $1 million for some lucrative jobs in the sector. This environment makes recruiting and retaining top data talent in life sciences challenging, especially since only a handful of computational biologists in large pharma companies can bridge the gap between biology and data engineering.

A data reality check

Only about 16% of drug discovery efforts currently use AI, even though triple-digit growth is projected. —Deloitte¹

Beyond the talent shortfall, the data itself often defies traditional structures within many life science organizations: “99% of their data is not tabular. Yet 99% of research and solutions out there focus on tables.” A bevy of data science tools, from Pandas in Python (for flexible data manipulation and analysis) to SQL (for structured querying of relational data) Tidyverse in R (for “tidying” data as well as analyzing it) and Tableau (for interactive visual dashboards), are table-centric. While the tools serve varying purposes, they often underscore a deep-seated tendency to favor data that can be flattened into rows and columns. The core problem is not the data but inadequate architectures and mindsets. To address this, Papadopoulos proposes three “radical” premises—each setting the stage for a more mature, infrastructure-first approach—and hints at a fourth consideration that naturally follows.

Premise 1: “Don’t touch AI without data infrastructure”

A case in point: Sometimes a calculator is better than a LLM

A recent preprint, “From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting,” highlights how incorporating external tools such as calculators can enhance an LLM’s math problem-solving abilities. Similarly, the “MathChat” framework described in another preprint describes the use of conversational interactions between LLM agents and user proxy agents to use code execution to tackle complex math problems.

“If you ask an LLM to do a division with fractional numbers, it might give the wrong answer because it’s probabilistic, not a processor,” Papadopoulos said. “But if you connect it to a calculator tool, the LLM knows to say: ‘Oh, that’s a math question—I should ask the calculator.’” Not to mention, a calculator is a profoundly more efficient tool than an LLM.

When it comes to the bright shiny object syndrome involving many AI projects, Papadopoulos is blunt: “You shouldn’t be touching AI unless you have a data management infrastructure in place.” Without robust governance, security, cataloging, and unified access controls, AI deployments will only amplify existing data chaos. By contrast, starting with a disciplined, database-centric foundation ensures that when AI is finally introduced, it acts as a force multiplier rather than a source of confusion or a costly, overengineered approach that often delivers marginally better results than simpler, well-structured machine learning models.

To make AI dreams a reality, Papadopoulos advocates a data-first mentality: “Focus all your efforts on building the best possible data management system. Once you do things manually and securely, then bring in AI to automate tasks.” With strong infrastructure, AI becomes the last piece of the puzzle, seamlessly tapping into well-organized, well-governed data stores. Then, AI systems can function as a natural language interface to complex data ecosystems. Instead of wrangling query languages, scientists can simply converse with the system. “Here’s where AI comes in. AI is not going to give you something that is insane,” Papadopoulos said. You can interface with the system with a natural language. This is the biggest value of AI for me. It’s going to understand what you’re asking, and then it’s gonna make the queries.”

Premise 2: “There is no unstructured data”

The idea that certain data types are inherently “unstructured” is a fundamental misconception. Papadopoulos asserts that every dataset, no matter how complex, contains intrinsic patterns:
“There is no unstructured data… White noise may be the closest thing to no structure, but it still follows a uniform distribution.”

Papadopoulos continued: “If you’re keeping [your data] ‘unstructured,’ it’s because you don’t have the right system to structure it, and that’s causing problems,” he said.

Finding order in chaos

“All data—tables, images, RNA, DNA, point clouds, satellite imaging—is essentially an array of values,” Papadopoulos explains. Even encrypted text and seemingly random signals have patterns under the right lens. The challenge: current modeling approaches and SQL-centric tools often flatten multi-dimensional data into rows and columns, stripping away critical context.

Implementation challenges

A reliance on table-centric models forces rich genomic, imaging, and clinical data into rigid, two-dimensional schemas. This mismatch leads to lost insights. For instance, genomic sequences demand structures that preserve hierarchical and multi-dimensional relationships, while imaging and clinical data need formats that capture complexity without artificial simplification.

Architectural approaches

Stavros Papadopoulos, Ph.D.

Embracing architectures that acknowledge all data as inherently structured—such as multi-dimensional arrays and schema-on-demand designs—enables more nuanced modeling. Domain-specific query languages can reflect scientific workflows rather than forcing everything through a SQL bottleneck.

Implications for life sciences

Recognizing that no data is truly “unstructured” frees researchers from legacy constraints. Instead of contorting data to fit outdated models, life sciences organizations can let the data’s natural patterns emerge. This shift paves the way for more accurate insights and a foundation on which advanced analytics or AI can thrive.

Premise 3: “No best data practices, only bad ones—for now”

In Papadopoulos’s view, current norms don’t even rise to the level of best practices:
“There are no best practices right now; we only see bad practices.”

As eluded to earlier, he warns that organizations rushing into AI without foundational work are setting themselves up for failure. Instead, they must start from scratch, building a database-centric, secure, and discoverable data ecosystem before turning to AI. Once these fundamentals are in place, AI can serve as a query interface to a unified system, letting researchers engage data through natural language rather than wrestling with fragmented, ad hoc setups.

Charting a path to better data practices

Beyond RAG: Agents, tools, and true data integration

While Retrieval-Augmented Generation (RAG) helps large language models fetch relevant textual snippets—often PDFs—it falls short for complex scientific data. “RAG is only for PDFs, pretty much,” Papadopoulos said. That is, it can’t handle the complexity of, say, multi-dimensional or genomic data.

A broader approach: He advocates employing LLMs as orchestrators that know which specialized tools or databases to consult. Instead of retrieving text alone, the LLM can interact with a unified data infrastructure, query domain-specific APIs, perform computations via calculators, or navigate multi-dimensional arrays. This transforms the LLM into a powerful proxy that can ask the “right questions” of your data ecosystem, extending well beyond simple text look-ups.

Once disciplined data management is established, the LLM can function as an orchestrator—tapping into specialized tools, databases, or calculation engines behind the scenes. Scientists need not become API experts; their focus remains on the research questions at hand.

Papadopoulos’s roadmap challenges the status quo. Instead of half-measures, he urges a complete rethinking of data infrastructure, explicitly aligning with foundational principles akin to FAIR (Findable, Accessible, Interoperable, Reusable) from the get-go: “If you start with the database, if you start with discipline since day one—thinking about discoverability, accessibility, and all of that stuff—you’ve got the information security covered,” he explains. By embedding FAIR-like principles into the architecture, organizations ensure that data is not only properly governed and secured but also well-positioned for future interoperability and reusability.

Start with robust infrastructure: Treat all scientific information—genomic sequences, imaging, clinical data—as inherently structured. “Focus all your efforts to build the best possible data management system,” he says.
Prioritize governance and security: Authentication, authorization, and auditing must be designed in from day one. “If you start with a database approach and discipline,” Papadopoulos notes, “you’ve got the information security covered.”
Establish discipline and consistency: Integrate all data sources into a unified, discoverable repository that aligns with research workflows. This makes “anything you want” discoverable.
Master manual workflows before automating: Understand the patterns and bottlenecks. Only then should you tap AI to automate repetitive tasks. If the LLM (large language model) doesn’t have a structured system to query, no amount of intelligence will yield meaningful insights.
Use AI as the final layer: With a solid infrastructure in place, AI becomes a powerful ally—a natural-language interface to a rich data ecosystem. Rather than forced onto messy data, AI works in harmony with it.

In essence, these “radical” premises aren’t about adding complexity; they’re about removing barriers. Once organizations internalize that all data can be structured and properly governed, AI moves from hype to practical accelerator, allowing scientists and researchers to focus on true innovation rather than endless big data wheel spinning.

“Focus all your energy over the next five years, as AI develops, on figuring out your data management story,” Papadopoulos said.

References:

The data point is from Deloitte’s “2024 Global Life Sciences Sector Outlook” report published on May 31, 2024, which states: “Research shows that AI currently accounts for approximately 16% of drug discovery efforts and is predicted to grow by 106% over the next three to five years.”

A data reality check

Premise 1: “Don’t touch AI without data infrastructure”

A case in point: Sometimes a calculator is better than a LLM

Premise 2: “There is no unstructured data”

Finding order in chaos

Implementation challenges

Architectural approaches

Implications for life sciences

Premise 3: “No best data practices, only bad ones—for now”

Charting a path to better data practices

Beyond RAG: Agents, tools, and true data integration

Related Articles Read More >

Pistoia Alliance on why 69% of life sciences firms can’t measure AI’s impact, and the architectural shift that could change that

Accelerating drug discovery: The convergence of quantum chemistry, machine learning and molecular dynamics

AI co-scientist startup CTO on ‘hypothesis slop,’ deterministic code and the goal of compressing a century of science into a decade

Google’s Gemini 3.5 Flash scores within two points of Anthropic’s flagship at roughly one third of the price

Search R&D World