Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Ali, Rohaid; Tang, Oliver Y.; Connolly, Ian D.; Fridley, Jared; Shin, John H.; Sullivan, Patricia L. Zadnik; Cielo, Deus; Oyelese, Adetokunbo A.; Doberstein, Curtis E.; Telfeian, Albert E.; Gokaslan, Ziya L.; Asaad, Wael F.

doi:10.1227/neu.0000000000002551

articleNeurosurgeryJun 12, 2023Closed access

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

RARohaid Ali OYOliver Y. Tang IDIan D. Connolly JFJared Fridley JHJohn H. Shin

Alion Science and Technology (United States) · Brown University · +4 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Methods

The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.

Results

On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P

Citation impact

316

total citations

FWCI: 11.46
Percentile: 100%
References: 8

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Neurosurgery
Medicine
Odds ratio
Logistic regression
Order (exchange)
Internal medicine
Surgery

No related works found for this paper.