Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features
Johns Hopkins University · Institute for Research in Fundamental Sciences · +1 more institution
Abstract
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust…
Citation impact
- FWCI
- 13.66
- Percentile
- 100%
- References
- 39
Authors
4Topics & keywords
- Support vector machine
- Computer science
- ENCODE
- k-mer
- Pattern recognition (psychology)
- Bayes' theorem
- Artificial intelligence
- Classifier (UML)