preprintarXiv (Cornell University)Dec 26, 2022GREEN OA

Large Language Models Encode Clinical Knowledge

Indexed inarxivdatacite

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers…

Citation impact

259
total citations
FWCI
Percentile
References
0
Citations per year

Authors

30

Topics & keywords

Keywords
  • Computer science
  • Benchmark (surveying)
  • Harm
  • Artificial intelligence
  • Key (lock)
  • Machine learning
  • Data science
  • Language model
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.