
[Adobe Stock]
But what impact might this have on real-world scientific research? “AI will power all steps of R&D&I — including those involving creativity,” declares Arthur D. Little’s recently published Eureka! on Steroids report.
In some ways, AI systems are going far beyond chatbot-level functionality in STEM-like tasks. “We’ve actually seen entire workflows automated by AI agents—specifically in software development,” said Zoe Huczok, a manager at Arthur D. Little and co-author of the report. Huczok points to Sakana.ai as an example: an application that’s essentially a ‘scientist agent’ for R&D. The application demonstrates “how to automate the entire software research lifecycle, from hypothesis formulation all the way to publication,” Huczok said. “That’s impressive, but code is inherently linguistic and self-contained, with little interaction with the physical world.”
Real-world R&D is more complex than benchmarks
GPQA Diamond benchmark scores
Rank | Model | Score (%) |
---|---|---|
1 | o3 model (currently unreleased) | 87.7 |
1 | o3‑mini (high) | 79.7 |
2 | o1‑preview | 78.3 |
3 | o1 | 78.0 |
4 | o3‑mini (medium) | 76.8 |
5 | DeepSeek R1 | 71.5 |
5 | o3‑mini (low) | 70.6 |
6 | Gemini 2.0 Pro (Experimental) | 64.7 |
7 | Gemini 2.0 Flash (GA) | 60.1 |
8 | o1‑mini | 60.0 |
9 | Claude 3.5 Sonnet | 59.4 |
10 | Gemini 1.5 Pro | 59.1 |
11 | xAI Grok‑2 | 56.0 |
12 | OpenAI GPT‑4o | 53.6 |
13 | Gemini 2.0 Flash‑Lite | 51.5 |
14 | Meta Llama 3.1 405B | 51.1 |
15 | Gemini 1.5 Flash | 51.0 |
Yet the impressive capabilities that systems like Sakana.ai demonstrate and benchmarks like GPQA don’t tell the whole story. “Overall, I don’t see AI completely transforming the nature of R&D,” Huczok noted. “The scientific method—coming up with a question, testing it, analyzing the results, and formulating conclusions—will stay the same.”
And while off-the-shelf and custom AI systems are getting better at science, “not every AI system has the depth of knowledge required,” Huczok added.
As the Eureka! on Steroids notes, “There is no blanket model for R&D&I tasks.” And the broader impact of AI in research and development hinges not only on raw performance but also on trust. “People don’t trust decision-makers who have nothing to lose, let alone machines, to handle decisions affecting their lives or major multi-year, billion-dollar investments—like in a nuclear plant,” Huczok said. “So trust and responsibility are reasons AI will augment, not replace.”
The rise of the research co-pilot
Despite rapid gains in AI’s scientific and technical abilities, the final responsibility for major decisions still rests squarely with humans. This is especially clear in R&D roles, where organizations see AI amplifying human expertise rather than replacing it.
“I haven’t seen any organizations laying off R&D staff just to delegate work to AI,” Huczok said. “Every one of them is asking, ‘How can we do more with the same number of employees, and how can we upskill them so they can tackle bigger challenges?’”
That stands in contrast to the world of software development, where AI can deliver a sizable productivity boost. The question of ‘replacement’ is most acute there. Note that Big Tech firms are noting that a sizable percentage of their code (more than 25% at Google) is now AI-generated. The Arthur D. Little paper stresses that research (or as it puts it, research and development and innovation R&D&I), is a tool to enhance researchers’ capabilities. “Among the companies we’ve spoken with, no one is planning to reduce headcount—everyone’s asking how to do more with the same number of employees,” Huczok said.

From the Eureka! on Steroids report.
Huczok points out that AI can take on routine or data-intensive tasks—things like literature reviews, experiment design, data analysis, and knowledge management. But guiding the research agenda still demands human judgment:
AI can plan experiments or summarize findings, but it isn’t asking the big questions on its own.
There is clearly a continued need for human-led inquiry: “AI still very much a productivity tool, not a replacement. AI can plan experiments or summarize findings, but it isn’t asking the big questions on its own,” Huczok said.
Skin in the R&D game
Takeaways from the Eureka! On Steroids report
- 1. End-to-end AI in R&D&I: Don’t expect just productivity gains, also explore how AI can bolster creative processes.
- 2. AI orchestration: Don’t think of AI as an island. Think of it as a glue for disparate workflows while keeping humans in the loop.
- 3. Maturity gap advantage: In a field where mature AI adoption is often nascent, don’t be afraid to take the plunge.
- 4. Problem-first AI: Target high-impact challenges rather than chasing trendy tools.
- 5. LLM productivity boost: Practical fine-tuning (e.g., Low-Rank Adaptation (LORA)) can bolster productivity.
- 6. Specialized models: Bigger isn’t always better. Smaller AI/Bayesian networks can excel in complex R&D problems.
- 7. Don’t overlook data: While AI as a technology is hot, it is nothing without data and the notion of garbage-in-garbage out can apply doubly in AI applications.
- 8. Trust through transparency: Mandatory human oversight and clear processes are key in high-stakes research.
- 9. Talent development race: Upskill teams now: AI talent shortage persists through 2030.
- 10. Inference-as-a-service: Cloud-like disruption enabling new AI business models.
Even as experimental AI models approach 90% accuracy on graduate-level reasoning tests like GPQA Diamond, their lack of skin in the lab game—detached precision without real-world consequences for failure—keeps human researchers uneasy about ceding high-stakes R&D decisions to machines that can’t share accountability.”
“As Huczok emphasizes, ‘People don’t trust decision-makers who have nothing to lose, let alone machines, to handle decisions affecting their lives or major multi-year, billion-dollar investments—like in a nuclear plant.” This plays into the augmentation vs. replace dynamic. This inherent need for human oversight highlights one of the biggest challenges in AI adoption: trust. Scientists often want to understand precisely how results are produced; they’re uneasy about delegating parts of the scientific method to a model they can’t fully interpret—especially if it hallucinates or exhibits bias.
Beyond trust, data emerges as the central factor determining whether AI truly transforms R&D. As the report notes, “Data is the game changer, not algorithms.” This means that, even as experimental AI systems begin to outstrip the capabilities of even experts, researcher’s ultimate prize isn’t the AI they used in their research. It’s the data assets they build along the way—durable, curated datasets that outlive any single experiment or model. Many organizations are only now realizing that cleaning, labeling, and securely sharing relevant datasets can be a bigger bottleneck than coding a model.
Uptake of AI in science remains gradual. Currently, “when it comes to specific R&D use cases driven by R&D teams themselves, we’re only seeing two or three proof-of-concept projects per client, each targeting a very specific challenge. They’re nowhere near automating entire workflows,” Huczok noted.
Data security remains a primary concern, with organizations balancing legitimate risks against exaggerated fears. As Huczok clarified: “One big misconception is that any data you enter will train an AI model that then serves other clients. It often requires a lot of education to clarify that professional, paid-for models don’t share your data and don’t continuously retrain on every prompt.”
Beyond security, a botched AI rollout threatens to undermine trust. If an organization adopts AI just because it’s trendy and it fails to deliver, they might dismiss AI entirely. Concerns about bias and ‘black box’ decision-making also loom large.
Scientists can lose confidence and may not try it again.
That sense of caution extends well beyond the lab. “Management consulting will definitely be impacted by AI, and it’s already happening,” Huczok observed. As more open-source research tasks become automated, consultants may shift from delivering insights to “building trust across organizations and helping navigate radical uncertainty.” According to Huczok, “When you have no data—or when you don’t expect the future to resemble the past—AI can’t help. That’s where consultants remain crucial.” The same principle applies across the research landscape: as AI continues its rapid evolution, expert guidance, human judgment, and deep trust will remain essential to realizing its full potential—and mitigating its risks—in high-stakes R&D.