Benchmarking large language models for biomedical natural language processing applications and recommendations
National Institutes of Health · United States National Library of Medicine · +5 more institutions
Abstract
The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that…
Citation impact
- FWCI
- 205.82
- Percentile
- 100%
- References
- 72
Authors
21- QCQingyu ChenCorresponding
National Institutes of Health, United States National Library of Medicine, Yale University
- YHYan Hu
The University of Texas Health Science Center, The University of Texas Health Science Center at Houston
- XPXueqing Peng
Yale University
- QXQianqian Xie
Yale University
- QJQiao Jin
National Institutes of Health, United States National Library of Medicine
Topics & keywords
- Computer science
- Biomedical text mining
- Benchmarking
- Data science
- Process (computing)
- Natural language processing
- Artificial intelligence
- Text mining
- Quality Education