Petabase-scale sequence alignment catalyses viral discovery
Verisk Analytics (United States) · Heidelberg Institute for Theoretical Studies · +12 more institutions
Abstract
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis…
Citation impact
- FWCI
- 80.81
- Percentile
- 100%
- References
- 109
Authors
15Topics & keywords
- Computational biology
- Biology
- Sequence (biology)
- RNA
- RNA polymerase
- RNA virus
- Polymerase
- Gene
- Industry, innovation and infrastructure
Funding
- DVDiamond V
- ANAgence Nationale de la RechercheAwards: ANR-18-CE45-0020, ANR-19-P3IA-0001, 19-P3IA-0001, ANR-18, ANR-19
- MDMinisterio de Economía y CompetitividadAward: PID2020-116008GB-I00
- SPSaint Petersburg State University
- UOUniversity of British Columbia
- RSRussian Science FoundationAwards: 19-14-00172, grant 19-14-00172
- KTKlaus Tschira Stiftung