Large Language Models lack essential metacognition for reliable medical reasoning
UCLouvain · Cliniques Universitaires Saint-Luc
Abstract
Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decision-making, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their…
Citation impact
- FWCI
- 88.87
- Percentile
- 100%
- References
- 48
Authors
4Topics & keywords
- Metacognition
- Recall
- Computer science
- Benchmark (surveying)
- Cognitive psychology
- Inclusion (mineral)
- Psychology
- Artificial intelligence