articleBMC BioinformaticsJan 25, 2007GOLD OA

Bias in random forest variable importance measures: Illustrations, sources and a solution

Ludwig-Maximilians-Universität München · Technical University of Munich · +3 more institutions

PubMed
Indexed incrossrefdatacitedoajpubmed

Abstract

Background

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.

Results

Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.

Citation impact

3,583
total citations
FWCI
16.82
Percentile
100%
References
42
Citations per year

Authors

4

Topics & keywords

Keywords
  • Random forest
  • Feature selection
  • Variable (mathematics)
  • Selection (genetic algorithm)
  • Scale (ratio)
  • Computer science
  • Random variable
  • Statistics
UN Sustainable Development Goals
  • Life in Land
No related works found for this paper.

Funding