Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

Chelli, Mikaël; Descamps, Jules; Lavoué, Vincent; Trojani, Christophe; Azar, Michel; Deckert, Marcel; Raynier, Jean-Luc; Clowez, Gilles; Boileau, Pascal; Ruetsch-Chelli, Caroline

doi:10.2196/53164

articleJournal of Medical Internet ResearchMay 22, 2024GOLD OA

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

MCMikaël Chelli JDJules Descamps VLVincent Lavoué CTChristophe Trojani MAMichel Azar

Assistance Publique – Hôpitaux de Paris · Hôpital Lariboisière · +2 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background Large language models (LLMs) have raised both interest and concern in the academic community. They offer the potential for automating literature search and synthesis for systematic reviews but raise concerns regarding their reliability, as the tendency to generate unsupported (hallucinated) content persist. Objective The aim of the study is to assess the performance of LLMs such as ChatGPT and Bard (subsequently rebranded Gemini) to produce references in the context of scientific writing. Methods The performance of ChatGPT and Bard in replicating the results of human-conducted systematic reviews was assessed. Using systematic reviews pertaining to shoulder rotator cuff pathology, these LLMs were…

Citation impact

273

total citations

FWCI: 29.11
Percentile: 100%
References: 30

Citations per year

Authors

10

Topics & keywords

Topics

Keywords

Systematic review
Context (archaeology)
Hallucinating
Grey literature
Psychology
Recall
MEDLINE
Computer science

UN Sustainable Development Goals

Quality Education

No related works found for this paper.