Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
AT&T (United States) · New York University
Abstract
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best…
Citation impact
- FWCI
- 37.48
- Percentile
- 100%
- References
- 47
Authors
2Topics & keywords
- Classifier (UML)
- Computer science
- Machine learning
- Artificial intelligence
- Training set
- Class (philosophy)
- Data mining