articleJan 1, 2003Closed access

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling

University of Alberta

Abstract

This paper takes a new look at two sampling schemes commonly used to adapt machine algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the least cost classifier be part of that standard as it can be better than undersampling for relatively modest…

Citation impact

833
total citations
FWCI
18.26
Percentile
100%
References
14
Citations per year

Authors

2

Topics & keywords

Keywords
  • Undersampling
  • Sampling (signal processing)
  • Computer science
  • Decision tree
  • Machine learning
  • Artificial intelligence
  • Sensitivity (control systems)
  • Class (philosophy)
No related works found for this paper.