Petabase-scale sequence alignment catalyses viral discovery

Edgar, R. C.; Taylor, Brie; Lin, Victor S.-Y.; Altman, Tomer; Barbera, Pierre; Meleshko, Dmitry; Lohr, Dan; Novakovsky, Gherman; Buchfink, Benjamin; Al-Shayeb, Basem; Banfield, Jillian F.; Peña, Marcos de la; Korobeynikov, Anton; Chikhi, Rayan; Babaian, Artem

doi:10.1038/s41586-021-04332-2

articleNatureJan 26, 2022HYBRID OA

Petabase-scale sequence alignment catalyses viral discovery

RCR. C. Edgar BTBrie Taylor VSVictor S.-Y. Lin TATomer Altman PBPierre Barbera

Verisk Analytics (United States) · Heidelberg Institute for Theoretical Studies · +12 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis…

Citation impact

521

total citations

FWCI: 80.81
Percentile: 100%
References: 109

Citations per year

Authors

15

Topics & keywords

Topics

Keywords

Computational biology
Biology
Sequence (biology)
RNA
RNA polymerase
RNA virus
Polymerase
Gene

UN Sustainable Development Goals

Industry, innovation and infrastructure

No related works found for this paper.