Structured information extraction from scientific text with large language models
Lawrence Berkeley National Laboratory · University of California, Berkeley
Abstract
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as…
Citation impact
- FWCI
- 59.03
- Percentile
- 100%
- References
- 68
Authors
8- JDJohn DagdelenCorresponding
Lawrence Berkeley National Laboratory, University of California, Berkeley
- ADAlexander Dunn
Lawrence Berkeley National Laboratory, University of California, Berkeley
- SLSang‐Hoon Lee
Lawrence Berkeley National Laboratory, University of California, Berkeley
- NWNicholas Walker
Lawrence Berkeley National Laboratory
- ARAndrew Rosen
Lawrence Berkeley National Laboratory, University of California, Berkeley
Topics & keywords
- Computer science
- Relationship extraction
- Information extraction
- Task (project management)
- Information retrieval
- JSON
- Natural language processing
- Simple (philosophy)
- Quality Education
Funding
- UDU.S. Department of EnergyAwards: -AC02-05CH11231, BES-ERCAP0024004, 05CH11231, AC02-05CH11231, DE-AC02, DE-AC02-05CH11231, DE-AC02-
- ACAdolph C. and Mary Sprague Miller Institute for Basic Research in Science, University of California Berkeley
- TRToyota Research Institute
- NENational Energy Research Scientific Computing CenterAwards: 05CH11231, AC02-05CH11231, BES-ERCAP0024004
- OOOffice of ScienceAwards: AC02-05CH11231, -AC02-05CH11231, DE-AC02
- BEBasic Energy SciencesAwards: DE-AC02, AC02-05CH11231, KCD2S2, DE-AC02-05CH11231, -AC02-05CH11231
- LBLawrence Berkeley National LaboratoryAwards: DE-AC02-05CH11231, 05CH11231, AC02-05CH11231