An extensive empirical study of feature selection metrics for text classification

Forman, George

articleMar 1, 2003Closed access

An extensive empirical study of feature selection metrics for text classification

Abstract

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives—accuracy, F-measure, precision, and recall—since each is appropriate in different situations. The results reveal that a new feature selection metric we call ‘Bi-Normal Separation…

Citation impact

2,390

total citations

FWCI: 49.63
Percentile: 100%
References: 15

Citations per year

Authors

1

GF
George FormanCorresponding
Hewlett-Packard (United States)

Topics & keywords

Topics

Keywords

Computer science
Feature selection
Artificial intelligence
Margin (machine learning)
Machine learning
Benchmark (surveying)
Feature (linguistics)
Metric (unit)

No related works found for this paper.