Large Language Models Encode Clinical Knowledge

Singhal, Karan; Azizi, Shekoofeh; Tu, Tao; Mahdavi, S. Sara; Jason, Wei,; Chung, Hyung Won; Scales, Nathan; Tanwani, Ajay Kumar; Cole-Lewis, Heather; Pfohl, Stephen; Payne, Perry W.; Seneviratne, Martin; Gamble, Paul; Kelly, Christopher B.; Scharli, Nathaneal; Chowdhery, Aakanksha; Mansfield, P.; Arcas, Blaise Agüera y; Webster, Dale A.; Corrado, Greg S.; Matias, Yossi; Chou, Katherine; Gottweis, Juraj; Tomašev, Nenad; Liu, Yun; Rajkomar, Alvin; Barral, Joëlle; Semturs, Christopher; Karthikesalingam, Alan; Natarajan, Vivek

doi:10.48550/arxiv.2212.13138

preprintarXiv (Cornell University)Dec 26, 2022GREEN OA

Large Language Models Encode Clinical Knowledge

KSKaran Singhal SAShekoofeh Azizi TTTao Tu SSS. Sara MahdaviWJWei, Jason

Indexed inarxivdatacite

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers…

Citation impact

259

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

30

Topics & keywords

Topics

Keywords

Computer science
Benchmark (surveying)
Harm
Artificial intelligence
Key (lock)
Machine learning
Data science
Language model

UN Sustainable Development Goals

Quality Education

No related works found for this paper.