Study shows LLMs diagnose patients more accurately than physicians

A new study in Science reveals that LLMs can now outperform physicians at diagnosis. The study demonstrated that OpenAI’s o1 model identified the correct or a very close diagnosis in 67% of early ER cases, compared to about 50% to 55% for physicians.

Adobe

The model was required to complete various tasks such as reading medical profiles and suggesting a diagnosis, choosing next steps or estimating the probability of a specific change in future health. In all of these tasks, the model performed similarly to or outperformed physicians.

LLM achieved perfect clinical reasoning score 98% of the time

In one task, o1 received a perfect clinical reasoning score based on how well it explained diagnostic thinking and next steps for 98% of cases, where attending physicians only did so 35% of the time. This suggests that the model may be more consistent in documenting and articulating medical logic than human clinicians under stress.

The researchers used cases from real patients who went to the ER at Beth Israel Deaconess Medical Center, supplying information to the model in increments that mirrored the patient experience: describing symptoms to the intake nurse, evaluation by a doctor and physician decision-making.

Early in the process, when a patient checks in to the ER, o1 identified an exact or close diagnosis 67% of the time. This was more than 10% higher than the results of two physicians given the same cases. The model still outperformed doctors by 2% to 10% later in the care pipeline.

Limitations and next steps

OpenAI’s o1 model was released in late 2024, already an “older” model in the fast-moving industry. Newer models could perform even better at similar tests.

The study was limited to the relatively short stays typical to the ER. It did not evaluate how an LLM would perform with more information, meaning the study is not comparable to the diagnostic process in other settings. In addition, the study only used written case information and did not provide the model with other inputs, such as imaging.

The team is already conducting new experiments using longer-term and broader real-world information. Further tests could help determine if LLMs can improve patient care in the real world across clinical settings.

LLM achieved perfect clinical reasoning score 98% of the time

Limitations and next steps

Related Articles Read More >

How Cypris evolved from selling patent reports to agentic R&D intelligence

Medable’s Digital Data Flow Agent focuses on protocol translation as the agentic race accelerates

AI image firm Midjourney spins up health division, unveils ‘Ultrasonic CT’

SpaceX is now worth nearly as much as 41 aerospace peers combined. Its revenue is another story

Search R&D World