By Ellen Feldman, MD
Synopsis: A controlled trial comparing physician diagnostic performance with and without artificial intelligence (AI) found no significant difference in accuracy, quality, or other metrics, although AI alone outperformed both groups.
Source: Goh E, Gallo R, Hom J, et al. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw Open. 2024;7(10):e2440969.
The enduring complexity of medical diagnosis highlights the necessity of blending a robust scientific knowledge base with sound clinical judgment and skill — an integration that remains essential, even as we explore the potential of tools like artificial intelligence (AI) to support decision-making.1
Large language models (LLMs) represent a remarkable leap in AI, transforming how we interact with and process information. These AI systems are built on deep learning architectures, enabling them to generate, analyze, and interpret human-like text with a level of sophistication that is a revelation to many. Trained on vast data sets, encompassing everything from literature and news articles to scientific research and social media, LLMs have found applications in fields as diverse as education, law, creative writing, customer service, and medicine. Their ability to adapt to context, generate coherent responses, and handle complex inquiries has made them valuable tools for streamlining workflows and solving problems.2
In medicine, LLMs offer the potential to revolutionize clinical care by enhancing diagnostic accuracy, personalizing patient education and care plans, and supporting evidence-based decision-making.3 For primary care providers juggling high patient volumes and complex medical cases, the application of LLMs holds promise in mitigating cognitive fatigue and expanding diagnostic capabilities. However, as their use grows, so do the questions about how best to incorporate these tools into clinical practice while addressing and recognizing their limitations.3,4
This study delves into the intersection of AI innovation and clinical application, exploring how a commercially available LLM may or may not enhance diagnostic reasoning among physicians.
A total of 50 attending and resident physicians trained in general medicine specialties and with a median of three years of practice were recruited from several large healthcare systems to participate in this controlled study. The participants were randomized into two groups — the control group had access to conventional clinical diagnostic tools (such as Up-To-Date and/or Google), and the intervention group was provided with access to a publicly available LLM (GPT-4) in addition to these sources.5 Both groups were given one hour to provide a differential diagnosis, outline supporting and opposing factors, and suggest next steps in management for six newly-generated clinical vignettes.
The study measured diagnostic performance using a standardized rubric assessing accuracy, reasoning quality, and proposed evaluations.
There was no significant difference observed in diagnostic reasoning scores between the AI-assisted group and the control group. Median scores were 76% and 74%, respectively, with an adjusted difference of 2 percentage points (P = 0.60). Additionally, the time spent per case was comparable between the two groups.
However, when the AI was assessed independently, it achieved diagnostic reasoning scores 16 points higher than the control group (P = 0.03). While this finding highlights the model’s intrinsic capabilities and its potential as a powerful standalone resource, it also raises questions about why the physician performance was not enhanced when using AI as an adjunct tool.
Commentary
Diagnostic errors are a significant concern in healthcare, contributing to patient morbidity and mortality.6 The integration of AI into clinical practice has been a subject of considerable interest, particularly regarding the potential to enhance diagnostic accuracy and reduce errors. This randomized trial, designed to assess the impact of AI on physician diagnostic reasoning, represents a step forward in providing evidence about the use of LLMs in everyday clinical settings.
Before exploring the findings of this study in detail, it is important to note that only one LLM, GPT-4 (developed by Open AI), was used in this research. However, numerous other LLMs, including Gemini (developed by Google), Llama (developed by Meta), and Claude (developed by Anthropic) also are available for commercial use.6-8 These models, and others like them, may have clinical applications in medicine. Future studies in this domain may want to specify whether their findings are specific to a single LLM or can be generalizable across different AI models, since the variability in architecture, training data, and capabilities could significantly influence results.9
As noted, the findings from the Goh et al study demonstrated that AI alone could outperform physicians in diagnostic tasks. However, its use as an adjunct tool did not lead to significant improvement in diagnostic performance when compared to conventional resources. This outcome points toward an important limitation in the integration of AI into clinical workflows.
One plausible explanation for this finding is the lack of familiarity and training among physicians in effectively leveraging LLMs. Without clear documentation of how the AI was used, it remains uncertain whether participants actively incorporated its suggestions into their diagnostic reasoning or simply relied on their usual clinical approach. Notably, participants in this study were not provided with specific guidelines or instructions on how to effectively integrate AI into their diagnostic process. As a result, the potential benefits of the AI tool may not have been fully realized. Future research should explicitly track AI usage patterns to determine how engagement levels influence outcomes.
This points to the importance of structured education and training programs to help clinicians integrate advanced AI technologies into their practice. Merely providing access to sophisticated tools is unlikely to yield optimal results without equipping healthcare professionals with the skills and knowledge needed to use them effectively.3,4
Furthermore, these findings raise questions about the role of AI in clinical decision-making and the best practices for its implementation. Future research should attempt to address several key questions:
- How can clinicians be trained to use LLMs most effectively?
- What types of cases or clinical scenarios benefit most from AI augmentation?
- How can AI tools be integrated into existing clinical workflows to complement providers’ expertise?
The lattermost question is essential. It is important to note that the design of this study involved the use of clinical vignettes, which, while valuable for standardizing assessments, do not fully capture the complexities and nuances of real-world clinical environments. In actual practice, diagnostic reasoning involves dynamic interactions with patients, incomplete data, and evolving clinical scenarios. These factors may represent a challenge to LLMs.
Recognizing the limitations of LLMs may be as valuable as learning how to harness their capabilities. LLMs cannot perform physical examinations, interpret nuanced patient histories, or make ethical judgements — skills that are integral to effective medical practice.3,4 Further studies likely will conclude that the optimal approach is a collaborative model, where AI serves as a supportive tool to augment human expertise, with targeted training necessary to optimize this partnership.
While this study did not directly address bias mitigation or data security, both are essential considerations when discussing the use of AI in healthcare. Providers using AI diagnostically must recognize that biases in AI models often arise from the data sets used for training, which may underrepresent certain populations or reflect systematic inequalities. This can lead to disparities in diagnostic accuracy or treatment recommendations. Similarly, safeguarding patient data is paramount, since AI tools often require large amounts of sensitive information to function effectively. Addressing these issues, along with other factors affecting patient care, is vital to building trust and ensuring that AI tools enhance, rather than compromise, patient health.3,4
This study represents a stepping stone in understanding how advanced AI tools fit into the evolving landscape of medicine. The findings suggest that while LLMs hold promise, their integration into clinical practice requires thoughtful implementation, clinician training, and a recognition of their inherent limitations. The focus is not whether AI belongs in healthcare, but on how to effectively use its strengths to support clinicians and improve patient outcomes.
Disclaimer: Dr. Feldman discloses that a relative is employed at Open AI. However, this relationship had no influence on the content, analysis, or conclusions presented in this article. The review is based solely on independent research, scientific evidence, and professional expertise. No financial support, incentives, or input from Open AI were provided or solicited in the preparation of this work. Dr. Feldman affirms that there are no conflicts of interest affecting the integrity or objectivity of this publication.
Ellen Feldman, MD, works for Altru Health System, Grand Forks, ND.
References
- [No authors listed]. Uncertainty in medicine. Lancet. 2010;375(9727):1666.
- Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manage. 2023;71:102642.
- Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689.
- Ahsan MM, Luna SA, Siddique Z. Machine-learning-based disease diagnosis: A comprehensive review. Healthcare (Basel). 2022;10(3):541.
- Open AI. ChatGPT. https://openai.com/chatgpt/overview/
- Wahyudi A. Advanced AI language models: A comparison of ChatGPT, Google, Sparrow, LaMDA, PaLM, GShard, BERT, RoBERTa, GPT-2, T5, and XLNet. Medium. Published April 16, 2023. https://insinyur.medium.com/advanced-ai-language-models-a-comparison-of-chatgpt-google-sparrow-lamda-palm-gshard-bert-e41acbdfb996
- Meta. Llama. https://www.llama.com
- Anthropic. Meet Claude. https://www.anthropic.com/claude
- Budnikov M, Bykova A, Yamshchikov IP. Generalization potential of large language models. Neural Comput & Applic. 2024;37:1973-1997.
A controlled trial comparing physician diagnostic performance with and without artificial intelligence (AI) found no significant difference in accuracy, quality, or other metrics, although AI alone outperformed both groups.
You have reached your article limit for the month. Subscribe now to access this article plus other member-only content.
- Award-winning Medical Content
- Latest Advances & Development in Medicine
- Unbiased Content