articleNature CommunicationsMar 6, 2024GOLD OA

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

University of Münster

PubMed
Indexed incrossrefdoajpubmed

Abstract

It is likely that individuals are turning to Large Language Models (LLMs) to seek health advice, much like searching for diagnoses on Google. We evaluate clinical accuracy of GPT-3·5 and GPT-4 for suggesting initial diagnosis, examination steps and treatment of 110 medical cases across diverse clinical disciplines. Moreover, two model configurations of the Llama 2 open source LLMs are assessed in a sub-study. For benchmarking the diagnostic task, we conduct a naïve Google search for comparison. Overall, GPT-4 performed best with superior performances over GPT-3·5 considering diagnosis and examination and superior performance over Google for diagnosis. Except for treatment, better performance on frequent vs…

Citation impact

184
total citations
FWCI
19.67
Percentile
100%
References
21
Citations per year

Authors

4

Topics & keywords

Keywords
  • Benchmarking
  • Medical diagnosis
  • Transparency (behavior)
  • Health care
  • Computer science
  • Medicine
  • Pathology
  • Business
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.

Funding