articleThe American Journal of Emergency MedicineJan 20, 2026HYBRID OA

Empowering front-line physicians with AI: Evaluating large language models in everyday ENT care

Sheba Medical Center · Children's National · +3 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

Methods

Twelve clinical vignettes representing routine and urgent presentations were developed and validated by otolaryngologists. One hundred practicing physicians in family medicine and emergency medicine, including residents and attending physicians, completed all vignettes by providing a diagnosis, management plan, and referral decision. Four large language models (Gemini-2.0, ChatGPT-4.0, ChatGPT-5, and OpenEvidence) were tested using identical prompts. Model outputs were anonymized, randomized, and rated by a blinded expert panel using the Quality Analysis of Medical Artificial Intelligence tool, which assesses accuracy, clarity, completeness, sourcing, relevance, and usefulness.

Results

Physicians achieved mean diagnostic accuracy of 91.6% and management accuracy of 87.9%. In non-urgent cases, 30.4% of responses represented inappropriate referral. Only half recognized the need for urgent referral in a cerebrospinal fluid leak scenario. Large language models demonstrated comparable diagnostic and management accuracy with higher referral appropriateness.

No related works found for this paper.