Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Weiss, Gary M.; Provost, Foster

doi:10.1613/jair.1199

articleJournal of Artificial Intelligence ResearchOct 1, 2003DIAMOND OA

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

GMGary M. Weiss FPFoster Provost

AT&T (United States) · New York University

Indexed inarxivcrossrefdoaj

Abstract

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best…

Citation impact

926

total citations

FWCI: 37.48
Percentile: 100%
References: 47

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Classifier (UML)
Computer science
Machine learning
Artificial intelligence
Training set
Class (philosophy)
Data mining

No related works found for this paper.