preprintbioRxiv (Cold Spring Harbor Laboratory)Oct 2, 2023GREEN OA

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Westlake University

Indexed incrossref

Abstract

A bstract Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsupervised training on residue sequences. They have become essential tools for researchers and practitioners in biology. However, a limitation of vanilla PLMs is their lack of explicit consideration for protein structure information, which suggests the potential for further improvement. Motivated by this, we introduce the concept of a “ s tructure- a ware vocabulary” that integrates residue tokens with structure tokens. The structure tokens are derived by encoding the 3D structure of proteins using Foldseek. We…

Citation impact

227
total citations
FWCI
Percentile
References
45
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Vocabulary
  • Natural language processing
  • Linguistics
  • Artificial intelligence
  • Philosophy
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.