An extensive empirical study of feature selection metrics for text classification
Hewlett-Packard (United States)
Abstract
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives—accuracy, F-measure, precision, and recall—since each is appropriate in different situations. The results reveal that a new feature selection metric we call ‘Bi-Normal Separation…
Citation impact
- FWCI
- 49.63
- Percentile
- 100%
- References
- 15
Authors
1Topics & keywords
- Computer science
- Feature selection
- Artificial intelligence
- Margin (machine learning)
- Machine learning
- Benchmark (surveying)
- Feature (linguistics)
- Metric (unit)