Can large language models replace humans in systematic reviews? Evaluating  GPT ‐4's efficacy in screening and extracting data from peer‐reviewed and grey literature in multiple languages

Khraisha, Qusai; Put, S. Van; Kappenberg, Johanna; Warraitch, Azza; Hadfield, Kristin

doi:10.1002/jrsm.1715

articleResearch Synthesis MethodsMar 14, 2024HYBRID OA

Can large language models replace humans in systematic reviews? Evaluating GPT ‐4's efficacy in screening and extracting data from peer‐reviewed and grey literature in multiple languages

QKQusai Khraisha SVS. Van Put JKJohanna Kappenberg AWAzza Warraitch KHKristin Hadfield

Trinity College Dublin · University of York

PubMed

Indexed incrossrefpubmed

Abstract

Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a "human-out-of-the-loop" approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset…

Citation impact

187

total citations

FWCI: 19.78
Percentile: 100%
References: 43

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Systematic review
Computer science
Grey literature
Data extraction
Inclusion and exclusion criteria
Meta-analysis
Inclusion (mineral)
Machine learning

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.