UniRef: comprehensive and non-redundant UniProt reference clusters
Georgetown University · Georgetown University Medical Center
Abstract
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and…
Citation impact
- FWCI
- 13.45
- Percentile
- 100%
- References
- 53
Authors
5- BEBarış Ethem SüzekCorresponding
Georgetown University, Georgetown University Medical Center
- HHHongzhan Huang
Georgetown University, Georgetown University Medical Center
- PBPeter B. McGarvey
Georgetown University, Georgetown University Medical Center
- RMRaja Mazumder
Georgetown University, Georgetown University Medical Center
- CWCathy Wu
Georgetown University, Georgetown University Medical Center
Topics & keywords
- UniProt
- Cluster analysis
- RefSeq
- Computer science
- Annotation
- Sequence (biology)
- Sequence database
- Similarity (geometry)