articleJournal of Clinical MedicineFeb 14, 2026GOLD OA

Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department

AHEPA University Hospital

PubMed
Indexed incrossrefpubmed

Abstract

Background

Large language models (LLMs) are increasingly proposed as clinical decision support tools. However, their reliability in the emergency department (ED) triage remains insufficiently validated. This study aimed to evaluate the performance and limitations of multiple LLMs in triage using a large retrospective dataset.

Methods

We conducted a retrospective analysis of 39,375 anonymized patient cases from the ED of AHEPA University General Hospital, Thessaloniki, Greece (June 2024–July 2025), extracted from the hospital’s electronic medical record system. All cases were triaged in real time according to the Emergency Severity Index (ESI) by 25 emergency physicians. In cases of uncertainty, a senior emergency physician was consulted. Seven LLMs (ChatGPT-5 Thinking, ChatGPT-5 Instant, Gemini 2.5, Qwen 3, Grok 4.0, Deep Seek v3.1, and Claude Sonnet 4) were evaluated against the physician-assigned ESI level (reference standard). Outcomes included triage score agreement (quadratic weighted kappa, κw), clinic referral accuracy and admission prediction. Subgroup analyses were performed by referral clinic and admission outcome. The study was conducted in accordance with TRIPOD-AI reporting guidelines.

Citation impact

4
total citations
FWCI
34.89
Percentile
99%
References
0
Too recent for citation history.

Authors

12

Topics & keywords

Keywords
  • Triage
  • Emergency department
  • Referral
  • Retrospective cohort study
  • Sonnet
  • Medical record
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.