Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department
Abstract
Large language models (LLMs) are increasingly proposed as clinical decision support tools. However, their reliability in the emergency department (ED) triage remains insufficiently validated. This study aimed to evaluate the performance and limitations of multiple LLMs in triage using a large retrospective dataset.
We conducted a retrospective analysis of 39,375 anonymized patient cases from the ED of AHEPA University General Hospital, Thessaloniki, Greece (June 2024–July 2025), extracted from the hospital’s electronic medical record system. All cases were triaged in real time according to the Emergency Severity Index (ESI) by 25 emergency physicians. In cases of uncertainty, a senior emergency physician was consulted. Seven LLMs (ChatGPT-5 Thinking, ChatGPT-5 Instant, Gemini 2.5, Qwen 3, Grok 4.0, Deep Seek v3.1, and Claude Sonnet 4) were evaluated against the physician-assigned ESI level (reference standard). Outcomes included triage score agreement (quadratic weighted kappa, κw), clinic referral accuracy and admission prediction. Subgroup analyses were performed by referral clinic and admission outcome. The study was conducted in accordance with TRIPOD-AI reporting guidelines.
Citation impact
- FWCI
- 34.89
- Percentile
- 99%
- References
- 0
Authors
12- INIoannis Nedos
AHEPA University Hospital
- SZSofia-Chrysovalantou Zagalioti
AHEPA University Hospital
- CKChristos Kofos
AHEPA University Hospital
- TKTheoni Katsikidou
AHEPA University Hospital
- DVDimitra Vellidou
AHEPA University Hospital
Topics & keywords
- Triage
- Emergency department
- Referral
- Retrospective cohort study
- Sonnet
- Medical record
- Peace, Justice and strong institutions