Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
National Institutes of Health · United States National Library of Medicine · +16 more institutions
Abstract
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6%…
Citation impact
- FWCI
- 11.39
- Percentile
- 100%
- References
- 18
Authors
18Topics & keywords
- Artificial intelligence
- Computer vision
- Computer science
- Precision medicine
- Medicine
- Pathology