Artificial intelligence (AI) tools, such as ChatGPT, have received acclaim for their potential to reduce clinician workloads. These tools can collect patient histories, triage cases, and even offer preliminary diagnoses. Referred to as large language models, these technologies are already employed by patients who depend on them to interpret symptoms and comprehend medical test results.
Yet despite their impressive performance on standardized medical exams, these AI models face challenges with more realistic, interactive patient-doctor conversations, according to a new study conducted by Harvard Medical School and Stanford University researchers. Their analysis, published on January 2 in Nature Medicine, introduces an evaluation framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine). This framework evaluated four large language models in scenarios resembling authentic patient encounters.
“All four large language models performed well on medical exam-style questions,” the study found, “but their performance declined when involved in conversations that more closely mimic real-world interactions.” This discrepancy highlights a two-fold need; the authors say: first, to create more realistic evaluations for assessing clinical AI tools, and second, to improve the models’ diagnostic abilities during genuine patient-doctor exchanges before implementing them in actual clinical settings.
Evaluation tools like CRAFT-MD, the team notes, not only can more accurately assess AI models for real-world readiness but can also help optimize their clinical performance. “Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations — the need to ask the right questions at the right moment, to piece together scattered information, and to reason through symptoms — presents unique challenges beyond simply answering multiple-choice questions. When we move from standardized tests to these natural conversations, even the most advanced AI models demonstrate significant declines in diagnostic accuracy.”
A Better Test for AI’s Real-World Performance
Developers often evaluate AI models by asking them to answer multiple-choice medical questions derived from national exams and residency certification tests. But this method rarely accounts for the messier reality of real-world patient care.
“This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world, this process is far messier,” said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. “We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform.”
CRAFT-MD aims to meet that need. It evaluates how effectively large-language models collect clinical details about symptoms, medications, and family history and then arrive at a diagnosis. An AI agent simulates a patient, providing conversational, natural responses, while another AI agent grades the final diagnosis. Human experts then review each interaction for thoroughness of history-taking, diagnostic accuracy, and adherence to prompts.
Researchers tested four large-language models — proprietary/commercial and open-source — across 2,000 clinical vignettes spanning 12 medical specialties commonly seen in primary care. While all performed well on structured exam questions, they struggled with more open-ended dialogue. According to the study, the models often failed to ask critical questions, missed key information, and had difficulty synthesizing scattered details, leading to less accurate diagnoses.
Recommendations for Optimizing AI’s Real-World Performance
Based on these findings, the research team proposes several steps for AI developers and regulators:
- Design, train, and test AI tools incorporating conversational, open-ended questions to mimic fundamental doctor-patient interactions.
- Evaluate models’ ability to ask crucial questions and extract essential patient information.
- Enhance AI tools to handle multiple back-and-forth conversations and integrate scattered information.
- Develop AI combining textual and non-textual data, such as images and EKGs.
- Develop more advanced AI agents capable of interpreting nonverbal cues like facial expressions, tone, and body language.
The researchers also recommend using both AI evaluators and human experts. CRAFT-MD, for instance, can process 10,000 conversations within 48 to 72 hours — plus around 15 to 16 hours of expert evaluation — compared with human-based approaches that would require nearly 1,150 hours of simulation and expert analysis. AI evaluators also prevent potential harm to actual patients from unverified tools.
CRAFT-MD is expected to evolve with new patient-AI models over time. “As a physician-scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions, and thus it helps move the field forward when it comes to testing AI model performance in health care.”
Authorship, Funding, Disclosures
The study, DOI: 10.1038/s41591-024-03328-5, was co-authored by Jaehwan Jeong and Hong-Yu Zhou (Harvard Medical School); Benjamin A. Tran (Georgetown University); Daniel I. Schlessinger (Northwestern University); Shannon Wongvibulsin (University of California-Los Angeles); Leandra A. Barnes, Zhuo Ran Cai, and David Kim (Stanford University); and Eliezer M. Van Allen (Dana-Farber Cancer Institute).
Funding support included the HMS Dean’s Innovation Award, and a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. Johri received additional support through the IIE Quad Fellowship.
Tell Us What You Think!