Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank
Alion Science and Technology (United States) · Brown University · +4 more institutions
Abstract
The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.
On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P
Citation impact
- FWCI
- 11.46
- Percentile
- 100%
- References
- 8
Authors
12- RARohaid AliCorresponding
Alion Science and Technology (United States), Brown University
- OYOliver Y. TangCorresponding
Alion Science and Technology (United States), Brown University
- IDIan D. ConnollyCorresponding
Alion Science and Technology (United States), Massachusetts General Hospital
- JFJared Fridley
Brown University
- JHJohn H. Shin
Massachusetts General Hospital
Topics & keywords
- Neurosurgery
- Medicine
- Odds ratio
- Logistic regression
- Order (exchange)
- Internal medicine
- Surgery