RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

Li, Wenjun; O’Neill, Kathleen; Haft, Daniel H.; DiCuccio, Michael; Chetvernin, Vyacheslav; Badretdin, Azat; Coulouris, George; Chitsaz, Farideh; Derbyshire, Myra K.; Durkin, A. Scott; Gonzales, Noreen R.; Gwadz, Marc; Lanczycki, Christopher J.; Song, James S.; Thanki, Narmada; Wang, Jiyao; Yamashita, Roxanne A.; Yang, Mingzhang; Zheng, Chanjuan; Marchler‐Bauer, Aron; Thibaud‐Nissen, Françoise

doi:10.1093/nar/gkaa1105

articleNucleic Acids ResearchNov 3, 2020GOLD OA

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

WLWenjun Li KOKathleen O’Neill DHDaniel H. Haft MDMichael DiCuccio VCVyacheslav Chetvernin

National Institutes of Health · National Center for Biotechnology Information

Indexed incrossrefdoaj

Abstract

Abstract The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, &gt;122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM.…

Citation impact

1,097

total citations

FWCI: 42.18
Percentile: 100%
References: 27

Citations per year

Authors

21

Topics & keywords

Topics

Keywords

RefSeq
Ensembl
UniProt
Annotation
Genome
Biology
Genome project
Computational biology

No related works found for this paper.