preprintNature CommunicationsApr 6, 2025GOLD OA

Benchmarking large language models for biomedical natural language processing applications and recommendations

National Institutes of Health · United States National Library of Medicine · +5 more institutions

PubMed
Indexed incrossrefdoajpubmed

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that…

Citation impact

108
total citations
FWCI
205.82
Percentile
100%
References
72
Citations per year

Authors

21

Topics & keywords

Keywords
  • Computer science
  • Biomedical text mining
  • Benchmarking
  • Data science
  • Process (computing)
  • Natural language processing
  • Artificial intelligence
  • Text mining
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding