Understanding overfitting in random forest for probability estimation: a visualization and simulation study

Barreñada, Lasai; Dhiman, Paula; Timmerman, D.; Boulesteix, Anne‐Laure; Calster, Ben Van

doi:10.1186/s41512-024-00177-1

articleDiagnostic and Prognostic ResearchSep 27, 2024GOLD OA

Understanding overfitting in random forest for probability estimation: a visualization and simulation study

LBLasai Barreñada PDPaula Dhiman DTD. Timmerman ABAnne‐Laure Boulesteix BVBen Van Calster

KU Leuven · Nuffield Orthopaedic Centre · +5 more institutions

PubMed

Indexed inarxivcrossrefdoajpubmed

Abstract

Background

Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.

Methods

For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).

Citation impact

110

total citations

FWCI: 34.62
Percentile: 100%
References: 41

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Overfitting
Random forest
Estimation
Visualization
Computer science
Statistics
Artificial intelligence
Mathematics

UN Sustainable Development Goals

Climate action

No related works found for this paper.