Software broke scientific reproducibility. AI hallucinations made it worse. Now the same technology is learning to catch its own mistakes.

Seven out of ten researchers say their work would be impossible without software. More than half write their own code. And among those who do, one in five has had no formal training in software development, according to the UK-based Software Sustainability Institute. This is the gap where silent failures live.

“There’s often a great sense of shame that comes from not having a lot of proficiency in this act of coding.” — Patrick Mineault, neuroscientist and AI researcher, in The Scientist

Imagine a Python script that drops every tenth row. Or a unit conversion that truncates decimals. A query returns undefined, and the pipeline keeps going. The plots render. The numbers look right. Months later, nobody can figure out why the results won’t replicate.

Researchers call these “semantic bugs,” errors in logic or design that don’t cause crashes but yield incorrect results. They’re the most prevalent and damaging type of error in scientific software, and their frequency is increasing. One analysis in Biochemical Pharmacology estimated that at least 25% of scientific discoveries may be false due to software bugs and related issues.

Such bugs contribute to one of the biggest problems in science: the reproducibility crisis. A 2016 Nature survey of more than 1,500 researchers found that over 70% had tried and failed to reproduce another scientist’s experiments, and more than half had failed to replicate their own.

It’s not just labs that are implicated. Software remains one of the leading causes of medical device recalls, with software-related recalls climbing from roughly 6% in the late 1980s to nearly 20% by the 2010s. Before the advent of generative AI and its hallucinations, software glitches still led to data problems. For instance, in 2006, structural biologist Geoffrey Chang discovered a bug in a script used to analyze protein structures, and retracted five papers from Science and other journals.

“People will not believe in science if we can’t demonstrate that scientific research is reproducible.” — Jay Pujara, Director, Center on Knowledge Graphs, USC Information Sciences Institute

While AI-generated code is certainly not immune from hallucinations either, the same technology is starting to catch bugs that humans miss, including its own. An AI assistant generating code in isolation has no way to know if that code actually works. But connect it to your database, your logs, your runtime environment, and it can watch the code execute and flag when something doesn’t add up.

Labs have rigorous protocols for physical safety; we know exactly what to do if a freezer fails. But we lack protocols for “digital safety.” Here’s what it looks like when an AI applies safety-log rigor to research code:

Digital ‘safety’ logs and pipeline incidents*

If we treated data pipelines like physical lab equipment, the safety log would look like this:

#	Severity	Incident / Data Anomaly	Status	Automated Action
1	🔴 HIGH	Scrubbing Error: 12% of Control Group dropped during normalization	✅ FIXED	Constraint added to `clean_data.py`; preventing silent drop.
2	🔴 HIGH	Unit Mismatch: ‘conc_ng_ul’ column detected as mg/mL in merge	🔍 TODO	Flagged for review. Suggested fix: Apply `1000x` conversion factor.
3	🟠 MEDIUM	P-Value Underflow: 400 genes return ‘0.0’ instead of scientific notation	✅ FIXED	Precision updated to `float64` in analysis script.
4	🔴 HIGH	Integrity Breach: Sample_ID ‘PAT_001’ maps to two different metadata rows	🔍 TODO	Query blocked. Requires manual distinct selection.

Technical debt

#	Severity	Issue	Impact	Status	Fix Plan
5	🟠 MEDIUM	Hardcoded file paths (`/Users/jdoe/...`)	Non-reproducible	🔍 TODO	Refactor to relative paths or config file.
6	🟡 LOW	Deprecated Pandas function `.append()`	Future breakage	🔍 TODO	Auto-rewrite to `pd.concat()`
7	🟡 LOW	SQL query lacking `ORDER BY`	Randomized results	⏳ WAIT	Low priority, but flagged for deterministic output.

* All hypothetical but based on real-world conditions

AI coding tools that check their own work

The technology enabling this is called Model Context Protocol (MCP), an open standard from Anthropic that lets AI assistants plug directly into databases, logs, and runtime environments. Think of it as USB-C for AI: one universal connector, many tools. Since its launch in November 2024, the protocol has spread quickly through AI tooling, with support from OpenAI, Microsoft and Figma. (Google DeepMind was still pushing its competing “Agent-to-Agent” (A2A) protocol as of mid-2025, though Google Drive and Workspace work with MCP via connectors). And it integrates with the tools scientists use daily, allowing the AI to “see” inside your SQL database, read your CSVs directly, or check your Python environment variables.

Instead of pasting error messages into an LLM and hoping for the best, an MCP-enabled assistant sees your actual environment. It watches queries execute. It reads the console errors you ignore. It catches the undefined before it corrupts your dataset.

“Code seems to be a cultural blind spot… a bug can easily mask whatever signal is present in the underlying data, producing spurious agreement or spurious disagreement in the final result.”
— David Soergel, “Rampant software errors may undermine scientific results,” F1000Research

The ecosystem is growing fast

A year ago, this infrastructure barely existed. Now it’s starting to look like basic instrumentation for software development. Labs already expect temperature logs for freezers and maintenance records for fume hoods. Execution logs for the code that moves, filters, and aggregates data are becoming just as feasible.

The first step doesn’t require a platform overhaul. Wire an assistant into one database or analysis service, give it permission to read logs and schema, and ask specific questions like “Check my Python script and flag anywhere rows are dropped without a warning.” Or: “Find Sample IDs that appear in sequencing results but are missing from the metadata table.”

That alone moves silent failures onto the same kind of checklist you already maintain for equipment and safety. A dropped row stops being an invisible footnote in a pipeline and becomes a tracked issue with an owner and a fix.

In a few years, retractions like Chang’s, or recalls rooted in a script bug, may look as dated as running an unlogged freezer. The expectation will be simpler: if code shapes the data behind a result, something should be watching it, asking whether the numbers still make sense.

Digital ‘safety’ logs and pipeline incidents*

Technical debt

AI coding tools that check their own work

The ecosystem is growing fast

Related Articles Read More >

AI agents in the R&D workforce: Moving beyond commodity AI

NTT Research launches Scale Academy with SaltGrain, a zero-trust data security suite

LabWare advances SaaS LIMS strategy at Pittcon 2026, one year after ASSURE launch

A team of scientists and GPT-5 beat a protein cost benchmark. Here’s who did what.

Search R&D World