Understanding overfitting in random forest for probability estimation: a visualization and simulation study
KU Leuven · Nuffield Orthopaedic Centre · +5 more institutions
Abstract
Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.
For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).
Citation impact
- FWCI
- 34.62
- Percentile
- 100%
- References
- 41
Authors
5- LBLasai BarreñadaCorresponding
KU Leuven
- PDPaula Dhiman
Nuffield Orthopaedic Centre, University of Oxford
- DTD. Timmerman
KU Leuven
- ABAnne‐Laure Boulesteix
Zimmer Biomet (Netherlands), Ludwig-Maximilians-Universität München
- BVBen Van Calster
Leiden University Medical Center, VIB-KU Leuven Center for Microbiology, KU Leuven
Topics & keywords
- Overfitting
- Random forest
- Estimation
- Visualization
- Computer science
- Statistics
- Artificial intelligence
- Mathematics
- Climate action