Autoscience claims first AI peer-reviewed paper via 'Carl'

[OpenAI image gen]

Fresh off securing three workshop acceptances out of four submissions at ICLR 2025 workshops, Autoscience Institute now claims a bigger coup. While the company framed those initial successes as making its AI agent Carl the “first AI system to produce academically peer-reviewed research,” it now reports that work generated by Carl has been accepted as a full-length paper to an ICLR 2025 workshop track. For this specific paper, Autoscience says only minor human edits, limited to citations and formatting, were required.

Autoscience distinguishes its achievement from similar assertions from competitor Sakana AI. noting Carl’s higher acceptance rate (3/4 vs. Sakana’s 1/3), passage through meta-review, and submission to standard ICLR workshops rather than a niche venue focused on negative results. Autoscience now reports that work generated by Carl has been accepted as a full-length paper to an ICLR 2025 workshop track. For this specific paper, Autoscience says only minor human edits, limited to citations and formatting, were required.z

Peer-review proof of concept

Founder and CEO Eliot Cowan described the earlier workshop streak as “a demonstration that this kind of automated research is possible.” He emphasized the significance: “this is the first time, to my knowledge, that legitimate, end-to-end AI-generated research—with a real hypothesis, executed and tested to scientific standards—passed peer review.” Cowan stressed that Carl had to deliver “legitimate, end-to-end” work for reviewers, spanning hypothesis generation, experiment design and manuscript preparation. Seeking formal review was crucial, he explained: “Putting outputs through peer review builds credibility for the automated research scientists we’re building, just like human scientists build credibility. It also contributes to the scientific community.”

Building on an idea from its initial workshop submissions, an improved version of Carl generated the full-length paper “Investigating Alignment Signals in Initial Token Representations,” accepted to an ICLR 2025 workshop. The paper reports achieving 93.1% accuracy using a probe to predict refusal outcomes based on early token representations.

From grad-class hack to autonomous research factory

Eliot Cowan

Carl originated when founder Cowan, who previously worked on machine learning projects at Alphabet’s X moonshot factory and MIT, sought to streamline his own research workload. Faced with juggling multiple final projects for machine learning seminar courses, he found the process “‘felt quite systematic and repetitive, so I thought, why not try to automate this?'” Cowan recalled. “I created software to test how well an LLM agent could ideate, read research, implement ideas, and write the paper. It did a surprisingly good job and significantly accelerated one project. That’s when I realized we were very close to having AI systems automate AI research.”

The tool began to evolve into a vision for an industrial-scale pipeline. “As a company, Autoscience is focused on figuring out how to condense, say, 10 years of AI research into a single year,” Cowan stated. As he put it: “We’re focused on achieving full autonomy, without human intervention.”

Once we can do that, we can scale up to potentially hundreds of thousands of automated research scientists building on each other’s work simultaneously. That’s what will unlock truly significant new contributions.

Humans’ ability to keep up with research is already stretched thin

Rapid output already strains both human attention and traditional peer review. Inside Autoscience, Cowan conceded: “Internally, we’re already running studies, feeding the results back into our corpus. Sometimes we produce more papers than we can even read ourselves. We might soon see research based on prior research that no human has ever read.”

External reviewers may feel similar pressure, yet fully machine-run evaluation still seems premature. “There was a trial of AI-assisted peer review at NeurIPS last December, reportedly quite successful,” Cowan said. “However, I haven’t yet seen a fully automated peer review system that I trust. It should exist, and the systems we’re building might eventually enable that, perhaps by assisting with reproducing results at scale, something human reviewers often lack time for.”

Beyond volume, Carl sometimes tries to strike out on its own. It’s a very common thing: our models try to connect to cloud compute or collect their own data,” Cowan said. “They really love the idea of going out and collecting new experimental data, but we completely blocked off Carl’s ability to do that for these experiments.”

Human–AI division of labor

For now, Autoscience treats the agent as a junior colleague whose strengths complement human planning. “Humans excel at long-horizon planning, which LLMs currently struggle with. Combining human long-term strategy and recall with an LLM’s ability to process vast amounts of literature seems like the first step toward meaningful progress. Eventually, as the AI systems improve, I anticipate most research will be driven primarily by them,” Cowan said.

Autoscience harbors ambitions beyond automating AI research itself. “The hope is that by drastically accelerating AI research, the resulting AI systems will become powerful tools for assisting in other fields like biomedical sciences, material science, etc.,” he said. If that vision holds, Carl’s ICLR 2025 workshop appearance could become a template for automated discovery across disciplines—provided the community can solve the hard problems of oversight, reproducibility and credit assignment first.

The company’s sprint from grad-class side project to ICLR workshop track acceptance offers a conspicuous data point in the debate over automated science. If the pattern repeats, tomorrow’s “principal investigator” may be made of silicon as well as flesh—and, eventually, operate far beyond today’s labs.

Peer-review proof of concept

From grad-class hack to autonomous research factory

Humans’ ability to keep up with research is already stretched thin

Human–AI division of labor

Related Articles Read More >

SpaceX is now worth nearly as much as 41 aerospace peers combined. Its revenue is another story

Q&A: Owkin’s five-year Sanofi deal bets on ‘purpose-built’ AI agents

Is Karpathy’s viral LLM wiki helpful? My opinion after one month of experimenting with one.

Leica, Indica Labs and Lunit team up as AI biomarker scoring moves toward clinical scale

Search R&D World