VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
University of Southern California · Fudan University
Abstract
Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses.
We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder's performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014.
Citation impact
- FWCI
- 38.60
- Percentile
- 100%
- References
- 88
Authors
5Topics & keywords
- Metagenomics
- Biology
- Computational biology
- Microbial ecology
- Medical microbiology
- Evolutionary biology
- Genetics
- Virology
Funding
- NSNational Science FoundationAwards: 1136818, DMS-1518001, OCE 1136818
- GAGordon and Betty Moore FoundationAward: GBMF3779
- NINational Institutes of HealthAward: R01GM120624
- NINational Institute of General Medical SciencesAward: R01GM120624
- DODivision of Mathematical SciencesAward: DMS 1518001
- DODivision of Ocean SciencesAward: 1136818