Evaluating large language models on medical evidence summarization

Tang, Liyan; Sun, Zhaoyi; Idnay, Betina; Nestor, Jordan G.; Soroush, Ali; Elias, Pierre; Xu, Ziyang; Ding, Ying; Durrett, Greg; Rousseau, Justin F.; Weng, Chunhua; Peng, Yifan

doi:10.1038/s41746-023-00896-7

articlenpj Digital MedicineAug 24, 2023GOLD OA

Evaluating large language models on medical evidence summarization

LTLiyan Tang ZSZhaoyi Sun BIBetina Idnay JGJordan G. Nestor ASAli Soroush

The University of Texas at Austin · Cornell University · +6 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical…

Citation impact

354

total citations

FWCI: 12.81
Percentile: 100%
References: 23

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Automatic summarization
Misinformation
Terminology
Computer science
Harm
Quality (philosophy)
Salient
Natural language processing

UN Sustainable Development Goals

Quality Education

No related works found for this paper.