A Springer Nature webinar traces the post-COVID evolution from keyword searches to agentic workflows

Neural network 3D illustration. Big data and cybersecurity. Data stream. Global database and artificial intelligence. Bright, colorful background with bokeh effect

[Image courtesy of Adobe Stock]

When the COVID-19 pandemic hit in early 2020, it created a bottleneck in scientific publishing. Roughly 30,000 articles on the virus were ultimately published, but between 400,000 and 600,000 manuscripts were submitted for peer review across publishers, according to a recent Springer Nature webinar. Researchers trying to identify symptoms, find candidates for drug repurposing, or trace transmission pathways were drowning in unstructured text.

That crisis, and the brute-force AI triage it later helped seed, set the stage for a broader transformation now underway in pharmaceutical R&D. During a September 2025 webinar hosted by Springer Nature, Prathik Roy, Ph.D., the publisher’s Product Director of Data Solutions and Strategy, laid out how the industry is moving from passive literature searches toward autonomous AI systems that can extract, synthesize, and act on scientific data in real time.

The shift is far from proven at scale, and the companies building these systems are still wrestling with fundamental questions about data rights, security, and whether AI can actually improve the pharmaceutical industry’s stubbornly high clinical failure rate.

From keyword search to entity extraction

For decades, pharmaceutical researchers relied on database products and tools like Google Scholar to surface relevant literature, what Roy describes as basic “discoverability.” AI was involved, but mostly in rudimentary ways: ranking search results, flagging related papers.

That changed as organizations began extracting specific entities from full-text articles, genes, proteins, candidate drugs, chemical compounds, and mapping relationships across disparate data sources including clinical trials, patents, and electronic health records. Roy pointed to use cases where partners are connecting symptoms to diseases to drug targets by pulling structured data from multiple source types simultaneously.

“That marked a shift from simple discovery-based solutions to actual extraction, enrichment, and prescriptive analytics,” Roy said.

DeepMind’s AlphaFold, which mined both protein databases and scientific literature to predict protein structures, was an early and high-profile example of this approach. But the underlying capability, extracting machine-readable entities from unstructured scientific text, is now widespread across pharma R&D.

The data feed is feeling its age

To power these more sophisticated applications, the underlying infrastructure is changing too. Historically, publishers and data providers delivered content through bulk “data feeds,” massive file dumps onto servers or cloud storage that required extensive internal processing before the data could be used.

Roy argued that model is becoming obsolete. “Data feeds are increasingly going out of fashion,” he said. “The new standard is using APIs equipped with an MCP layer.”

MCP, the Model Context Protocol, originally released by Anthropic in late 2024, has seen rapid adoption across the tech industry. By mid-2025, more than 5,000 active MCP servers had emerged, and the protocol was donated to the Linux Foundation’s Agentic AI Foundation in December 2025 with backing from OpenAI, Google, Microsoft, and Amazon. The premise is straightforward: MCP provides a standardized way for AI agents to query external data sources in real time, rather than relying on pre-loaded datasets. (For a deeper look at MCP’s implications for regulated industries, see this IntuitionLabs analysis.) That said, the protocol is barely a year old, and security researchers have flagged concerns including token theft, prompt injection, and overbroad permission scopes, risks that are especially acute in regulated pharma environments.

For pharmaceutical applications, the appeal is obvious. A drug discovery agent could simultaneously query a clinical trials database, a publisher’s API, a patent repository, and internal electronic health records, all through a common protocol, to generate a report on, say, human epidermal growth factor receptor research. Companies like Causaly, which launched its agentic AI research platform for life sciences in September 2025, are building exactly this kind of multi-source orchestration. Microsoft has pursued a parallel track with its AI Diagnostic Orchestrator and Healthcare Agent Orchestrator, which coordinate specialized agents across imaging, pathology, and clinical data to support cancer tumor boards at institutions including Stanford and Johns Hopkins.

Springer Nature currently offers an API layer for its open access content with MCP coming soon and is building toward broader coverage.

Roy described the three prerequisites for this kind of agentic workflow: “You need the legal rights, the data, and an API built with an MCP layer to support agentic workflows.”

A reality check on AI in drug discovery

The vision is compelling, but the broader track record of AI in drug discovery counsels caution.

A 2025 review published in Drug Target Review described the year as “not a revolution but a sobering reality check,” noting that while AI can compress early discovery timelines by 30 to 40 percent, it has not demonstrably improved the industry’s roughly 90 percent clinical failure rate. AI-discovered compounds appear to progress through clinical development at rates similar to traditionally discovered ones. Multiple AI-designed drugs were deprioritized or shelved after Phase II trials showed no efficacy signal.

The most notable success, Insilico Medicine’s rentosertib, one of the first drugs with both target and molecule designed entirely by AI, completed Phase IIa trials and showed improvement in lung function for idiopathic pulmonary fibrosis patients. But that remains an outlier. Recursion, one of the highest-profile AI drug discovery companies, shelved three prospective drugs in 2025 following its merger with Exscientia, and none of its AI-discovered compounds have reached market as approved drugs.

In other words, AI is proving most useful in the earliest stages of the pipeline, target identification, lead optimization, literature mining, rather than solving the fundamental biological challenges that cause drugs to fail in clinical trials.

The PDF paradox

One of the more unexpected developments Roy highlighted is the rehabilitation of the PDF. For years, PDFs were considered essentially useless for machine reading: anyone who has tried to copy structured data from a PDF knows the pain. Publishers invested heavily in creating machine-readable alternatives like structured XML and JSON.

But large language models have upended that assumption. “LLMs are now starting to work with PDFs much better,” Roy said, describing it as “life coming full circle.” This has practical implications: organizations sitting on archives of PDF-format research can now extract useful data from them without waiting for publishers to deliver structured alternatives.

Security in a regulated industry

For pharma companies dealing with proprietary drug targets and patient data, the security question looms large. Roy emphasized two non-negotiable requirements for any organization deploying AI on sensitive scientific data: a self-hosted environment that does not send data back to the model provider, and a zero-day retention policy ensuring the LLM is not training on proprietary inputs.

“If you are working with LLMs, especially if you’re looking to utilize one of the large enterprise AI models out there, ensure that the one that you have has a zero-day retention policy,” Roy said.

He also noted that publisher licensing agreements typically prohibit use of licensed content with external-facing public LLMs, a compliance consideration that many organizations may be underestimating as they experiment with AI tools.

Humans in the loop, still

Roy was candid about AI’s limitations, even while making the case for his company’s data products. He described a paradigm shift from “AI assisting humans” to “humans assisting AI:” meaning that human curation and oversight are now critical checkpoints within AI-driven workflows rather than upstream inputs.

“AI is not foolproof,” he said. “There are ethical considerations, hallucination risks, and reproducibility issues. Never discount the value of human curation and oversight.”

He also offered a glimpse of the unexpected places scientific literature is finding new applications. Financial markets, he noted, are now mining research publications to identify where specific chemicals and drug information are being published, attempting to predict how pharmaceutical stock prices might move, “a completely different example to what we are talking about today, but the possibilities are endless.”

Whether those possibilities translate into approved drugs reaching patients faster remains the open question. The infrastructure is evolving rapidly. The biology, as ever, has its own timeline.

Prathik Roy, Ph.D., presented “Accelerating Drug Discovery through SN Data APIs” during a Springer Nature webinar on September 17, 2025. For more information on Springer Nature’s text and data mining offerings, visit dev.springernature.com or springernature.com/text-and-data-mining.

From keyword search to entity extraction

The data feed is feeling its age

A reality check on AI in drug discovery

The PDF paradox

Security in a regulated industry

Humans in the loop, still

Related Articles Read More >

Elsevier expands LeapSpace with writing coach and Claim Radar, says 97% of users report time savings from the platform

Anthropic says Claude can run science experiments now rather than just plan them

OpenAI’s GPT-5.6 Sol sets a coding record. Its own system card says it cheats sometimes.

Noetik’s TARIO-2: A ‘world model’ that reads a tumor from a single slide

Search R&D World