articleEBioMedicineAug 22, 2023GOLD OA

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

National University of Singapore · National University Health System · +4 more institutions

PubMed
Indexed incrossrefdoajpubmed

Abstract

Background

Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs' accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries.

Methods

We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. 'Good' rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, 'poor' rated responses were further prompted for self-correction and then re-evaluated for accuracy.

Citation impact

306
total citations
FWCI
11.07
Percentile
100%
References
55
Citations per year

Authors

13

Topics & keywords

Keywords
  • Benchmarking
  • Scale (ratio)
  • Point (geometry)
  • Medicine
  • Test (biology)
  • Family medicine
  • Demography
  • Geography
UN Sustainable Development Goals
  • No poverty
No related works found for this paper.

Funding