Clustering predicted structures at the scale of the known protein universe
European Bioinformatics Institute · Seoul National University · +5 more institutions
Abstract
Abstract Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy 1 , and over 214 million predicted structures are available in the AlphaFold database 2 . However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm—Foldseek cluster—that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing…
Citation impact
- FWCI
- 51.64
- Percentile
- 100%
- References
- 65
Authors
10Topics & keywords
- Cluster analysis
- Structural similarity
- Computational biology
- Similarity (geometry)
- Structural Classification of Proteins database
- Protein structure database
- Protein domain
- Computer science
- Life in Land
Funding
- NRNational Research FoundationAward: 2021-R1C1-C102065
- SNSeoul National University
- ETEidgenössische Technische Hochschule Zürich
- NRNational Research Foundation of KoreaAwards: 2021-R1C1-C102065, 2021-M3A9-I4021220, RS-2023-00250470, 2020M3-A9G7-103933
- EZETH Zürich Foundation
- HHHelmut Horten Stiftung
- SSamsung