Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Sandmann, Sarah; Riepenhausen, Sarah; Plagwitz, Lucas; Varghese, Julian

doi:10.1038/s41467-024-46411-8

articleNature CommunicationsMar 6, 2024GOLD OA

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

SSSarah Sandmann SRSarah Riepenhausen LPLucas Plagwitz JVJulian Varghese

University of Münster

PubMed

Indexed incrossrefdoajpubmed

Abstract

It is likely that individuals are turning to Large Language Models (LLMs) to seek health advice, much like searching for diagnoses on Google. We evaluate clinical accuracy of GPT-3·5 and GPT-4 for suggesting initial diagnosis, examination steps and treatment of 110 medical cases across diverse clinical disciplines. Moreover, two model configurations of the Llama 2 open source LLMs are assessed in a sub-study. For benchmarking the diagnostic task, we conduct a naïve Google search for comparison. Overall, GPT-4 performed best with superior performances over GPT-3·5 considering diagnosis and examination and superior performance over Google for diagnosis. Except for treatment, better performance on frequent vs…

Citation impact

184

total citations

FWCI: 19.67
Percentile: 100%
References: 21

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Benchmarking
Medical diagnosis
Transparency (behavior)
Health care
Computer science
Medicine
Pathology
Business

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.

Funding

WW
Westfälische Wilhelms-Universität Münster