CD-HIT: accelerated for clustering the next-generation sequencing data

Fu, LiMin; Niu, Beifang; Zhu, Zhengwei; Wu, Sitao; Li, Weizhong

doi:10.1093/bioinformatics/bts565

articleBioinformaticsOct 11, 2012HYBRID OA

CD-HIT: accelerated for clustering the next-generation sequencing data

LFLiMin Fu BNBeifang Niu ZZZhengwei Zhu SWSitao Wu WLWeizhong Li

University of California San Diego

PubMed

Indexed incrossrefdoajpubmed

Abstract

SUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY:…

Citation impact

11,679

total citations

FWCI: 49.30
Percentile: 100%
References: 11

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Speedup
Computer science
Cluster analysis
Redundancy (engineering)
Data mining
Sequence (biology)
Parallel computing
Machine learning

No related works found for this paper.