Characterizing and measuring bias in sequence data

Ross, Michael; Russ, Carsten; Costello, Maura; Hollinger, Andrew; Lennon, Niall J.; Hegarty, Ryan; Nusbaum, Chad; Jaffe, David B.

doi:10.1186/gb-2013-14-5-r51

articleGenome biologyMay 29, 2013GOLD OA

Characterizing and measuring bias in sequence data

MRMichael Ross CRCarsten Russ MCMaura Costello AHAndrew Hollinger NJNiall J. Lennon

Broad Institute

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias.

Results

We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage.

Citation impact

961

total citations

FWCI: 40.53
Percentile: 100%
References: 49

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Biology
Human genetics
Genome Biology
Computational biology
Evolutionary biology
Sequence (biology)
Computational genomics
Genetics

No related works found for this paper.