SaProt: Protein Language Modeling with Structure-aware Vocabulary
Indexed incrossref
Abstract
A bstract Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsupervised training on residue sequences. They have become essential tools for researchers and practitioners in biology. However, a limitation of vanilla PLMs is their lack of explicit consideration for protein structure information, which suggests the potential for further improvement. Motivated by this, we introduce the concept of a “ s tructure- a ware vocabulary” that integrates residue tokens with structure tokens. The structure tokens are derived by encoding the 3D structure of proteins using Foldseek. We…
Citation impact
227
total citations
- FWCI
- —
- Percentile
- —
- References
- 45
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Computer science
- Vocabulary
- Natural language processing
- Linguistics
- Artificial intelligence
- Philosophy
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.