Genome-wide prediction of disease variant effects with a deep protein language model
University of California, San Francisco · Gladstone Institutes · +3 more institutions
Abstract
Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in…
Citation impact
- FWCI
- 128.44
- Percentile
- 100%
- References
- 74
Authors
5- NBNadav BrandesCorresponding
University of California, San Francisco
- GGGrant Goldman
University of California, San Francisco
- CHCharlotte H. Wang
University of California, San Francisco
- CYChun Ye
Gladstone Institutes, University of California, San Francisco, Parker Institute for Cancer Immunotherapy, Institute of Human Genetics, University of San Francisco
- VNVasilis Ntranos
University of California, San Francisco
Topics & keywords
- Missense mutation
- Computational biology
- Indel
- Workflow
- Biology
- Coding (social sciences)
- Genome
- Computer science