Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

Ali, Rohaid; Tang, Oliver Y.; Connolly, Ian D.; Sullivan, Patricia L. Zadnik; Shin, John H.; Fridley, Jared; Asaad, Wael F.; Cielo, Deus; Oyelese, Adetokunbo A.; Doberstein, Curtis E.; Gokaslan, Ziya L.; Telfeian, Albert E.

doi:10.1227/neu.0000000000002632

articleNeurosurgeryAug 15, 2023Closed access

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

RARohaid Ali OYOliver Y. Tang IDIan D. Connolly PLPatricia L. Zadnik Sullivan JHJohn H. Shin

Alion Science and Technology (United States) · Brown University · +5 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Methods

The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.

Results

ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent ( P = .963), GPT-4 outperformed both (both P .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.

Citation impact

203

total citations

FWCI: 7.35
Percentile: 100%
References: 12

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Medicine
Neurosurgery
MEDLINE
Medical physics
Radiology

UN Sustainable Development Goals

Quality Education

No related works found for this paper.