Large language models encode clinical knowledge
Google (United States) · United States National Library of Medicine · +1 more institution
Abstract
Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model 1…
Citation impact
- FWCI
- 502.22
- Percentile
- 100%
- References
- 91
Authors
32Topics & keywords
- Computer science
- Benchmark (surveying)
- Language model
- Comprehension
- Artificial intelligence
- Harm
- Key (lock)
- Unified Medical Language System
- Quality Education