articleJan 1, 2003Closed access
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling
Abstract
This paper takes a new look at two sampling schemes commonly used to adapt machine algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the least cost classifier be part of that standard as it can be better than undersampling for relatively modest…
Citation impact
833
total citations
- FWCI
- 18.26
- Percentile
- 100%
- References
- 14
Citations per year
Authors
2Topics & keywords
Topics
Keywords
- Undersampling
- Sampling (signal processing)
- Computer science
- Decision tree
- Machine learning
- Artificial intelligence
- Sensitivity (control systems)
- Class (philosophy)
No related works found for this paper.