How a recursive fact-checking RAG could transform R&D validation

Firefly A glowing network of interconnected nodes, representing scientific papers and citations, wit copy

[Adobe Firefly]

What if you could ask a large language model a research question and get back a definitive, contextually relevant answer with verifiable sources? Or trace a scientific claim back through a century of research to uncover the foundations of a modern theory? In a talk at the AI4 conference in Las Vegas, Daniel Schauer, an AI Engineer at Takeda Pharmaceuticals, introduced a recursive fact-checking tool designed to support scientific research. While the technique is often only effective a few layers deep, it often does provide insights into the evolution and foundation of scientific concepts.

The limits of LLMs in context understanding

Before discussing the fact-checking tool, Schauer highlighted a notable limitation of large language models. He shared an example where he asked LLMs — without the use of search — to cite Han Solo’s last line in Star Wars: Episode IV. “Most LLMs get this question wrong because they don’t really have a context-worthy understanding,” he explained. Yet by providing more context, it’s possible to get a more accurate response, but even slight changes in the question can lead to different answers. “We retrieved information relevant to the question, in this case, probably the ending of the movie script,” Schauer said. “We augmented our original prompt with that information and instructed the model to generate the answer from it.”

Schauer then introduced RAG (Retrieval-Augmented Generation) as a way to improve this process. RAG is a technique that enhances large language models by combining them with external information retrieval systems. In essence, the technique works by processing a query, retrieving relevant information from a vector database, augmenting the query with this context, and then using a language model to generate a response based on both its pre-trained knowledge and the retrieved information. Compared to using a large language model alone, the approach improves accuracy, allows for real-time updates, and improves contextual relevance.

The technique doesn’t always work perfectly, however. Schauer noted that RAG has limitations in fact-checking real-world information, as “it doesn’t actually validate how well the document supports the claim.”

Tracing claims through citation trees

To illustrate the potential of RAG for deeper analysis, Schauer introduced the concept of ‘citation trees.’ Imagine a family tree, but instead of individuals, it traces the lineage of scientific ideas. Each paper is like a family member, connected to its ancestors (the papers it cites) and its descendants (the papers that cite it).

Think of it like this: you read a claim in a 2023 paper about a specific protein function. That paper cites a 2015 paper for support. RAG systems can retrieve the relevant section from the 2015 paper. But what if that 2015 paper based its conclusion on findings from a 2002 paper? With the citation tree and RAG, the system can trace the lineage.

“Our source paper, ‘Tersteeg 2015,’ cites ‘Kokame 2002,’ which in turn cites a paper from 1924. For those not well-versed in scientific literature, the content from 1924 looks vastly different from today’s,” Schauer said. This journey back through time highlights how quickly we can reach the foundational research upon which modern science is built. It also illustrates the challenges of manual fact-checking, as the language and methods used in older papers can be difficult to interpret in a modern context. “As you can imagine, this could get really complicated for a human to do quickly,” Schauer said.

The process can also prove challenging for some RAG systems.

On the pitfalls of getting “Lost in the Middle”

The 2023 paper “Lost in the Middle: How Language Models Use Long Contexts” touches on the challenges of effectively utilizing information located in the middle of long input sequences, as highlighted by the U-shaped performance curve. The paper reported that GPT-3.5-Turbo’s multi-document QA performance can drop by more than 20% when relevant information is in the middle of the context compared to the beginning or end. The researchers noted the effect across various models, including those designed for long contexts like Claude-1.3 (100K) and GPT-3.5-Turbo (16K). They also noted that the effect was more pronounced in larger models, with 13B and 70B parameter models showing both primacy and recency bias, while 7B models were primarily recency-biased.

In addition to highlighting the limitations of off-the-shelf LLMs, the “Lost in the Middle” paper also provided an example of the weaknesses of some RAG systems. Schauer shared an example where he asked a RAG why compute memory increased quadratically with sequence length that used “Lost in the Middle” as an input. The initially generated answer was unsatisfactory: “The reference to the supporting document was omitted, and the answer didn’t explain why compute memory increased quadratically.”

To improve this, Schauer applied RAG to find a key source that the “Lost in the Middle” paper cited — the seminal Attention Is All You Need paper. When the RAG queried this paper, the model explained in mathematical terms why compute memory increased quadratically in relation to sequence length.

The recursive fact-checking process (illustrated in the flowchart) begins by analyzing the source document and its references. Claims with citations are extracted and transformed into queries. These queries are then used to search the supporting documents, using RAG to generate answers and sentiment analysis to determine if the claim is supported. This iterative process ensures thorough verification and provides full references for transparency.

Smart document checking is a core function to ensure that the RAG process is using the actual content of cited documents and not just their titles or bibliographic entries to generate answers. “This step is crucial for ensuring the accuracy of the fact-checking process,” Schauer explained, “What we don’t want is for the answer generated by RAG to be based on the titles of the documents it’s citing in support.”

Schauer also explained how the tool handles non-textual data like charts and figures. For data that can be represented textually, the system converts it to Markdown. For visual elements like diagrams, it employs computer vision techniques for validation. “When it is an actual chart or a diagram, I will use computer vision to attempt to validate that. Oftentimes, though, they don’t publish the raw data behind the chart, so it’s hard to know if the chart is an accurate representation of the findings in the raw data. If they don’t publish the raw data, then there’s very little I can do.”

Schauer also addressed the scalability of the tool, stating that it can process an average paper in about five minutes at a minimal cost.

Streamlining fact-checking: From web app to Word plugin

Schauer envisions a future where the recursive fact-checking tool is integrated into researchers’ workflows. He proposed developing an API that could power a Word plugin, allowing users to fact-check selected text or entire documents directly within their writing environment.

“It would be way better to rebuild this as an API so that we could easily make something like a Word plugin. That’s what my QC reviewers and my scientific writers are spending 90% of their day in,” he said. “So why not make it so they can, for instance, select some text and run that checker on that selected text or even the whole document? That feels like a no-brainer.”

The limits of LLMs in context understanding

Tracing claims through citation trees

On the pitfalls of getting “Lost in the Middle”

Streamlining fact-checking: From web app to Word plugin

Related Articles Read More >

Why Twist Bioscience’s complex genes offering is a bet on AI-driven protein design

Sandia turns to lightweight AI to speed up ceramic inspections for nuclear weapons components

Analyses find thousands of scientific papers with AI-generated errors

This week in AI research: Fields medalist says GPT-5.5 Pro did PhD-level math in an hour, Anthropic teaches Claude to ‘dream’

Search R&D World