Benchmarking large language models for biomedical natural language processing applications and recommendations

Chen, Qingyu; Hu, Yan; Peng, Xueqing; Xie, Qianqian; Jin, Qiao; Gilson, Aidan; Singer, Maxwell; Ai, X. C.; Lai, Po-Ting; Wang, Zhizheng; Keloth, Vipina K.; Raja, Kalpana; Huang, Jimin; He, Huan; Lin, Fongci; Du, Jingcheng; Zhang, Rui; Zheng, W. Jim; Adelman, Ron A.; Lu, Zhiyong; Xu, Hua

doi:10.1038/s41467-025-56989-2

preprintNature CommunicationsApr 6, 2025GOLD OA

Benchmarking large language models for biomedical natural language processing applications and recommendations

QCQingyu Chen YHYan Hu XPXueqing Peng QXQianqian Xie QJQiao Jin

National Institutes of Health · United States National Library of Medicine · +5 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that…

Citation impact

108

total citations

FWCI: 205.82
Percentile: 100%
References: 72

Citations per year

Authors

21

Topics & keywords

Topics

Keywords

Computer science
Biomedical text mining
Benchmarking
Data science
Process (computing)
Natural language processing
Artificial intelligence
Text mining

UN Sustainable Development Goals

Quality Education

No related works found for this paper.