Large Language Models Encode Clinical Knowledge
Indexed inarxivdatacite
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers…
Citation impact
259
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
30Topics & keywords
Topics
Keywords
- Computer science
- Benchmark (surveying)
- Harm
- Artificial intelligence
- Key (lock)
- Machine learning
- Data science
- Language model
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.