articleDiagnostic and Prognostic ResearchSep 27, 2024GOLD OA

Understanding overfitting in random forest for probability estimation: a visualization and simulation study

KU Leuven · Nuffield Orthopaedic Centre · +5 more institutions

PubMed
Indexed inarxivcrossrefdoajpubmed

Abstract

Background

Random forests have become popular for clinical risk prediction modeling. In a case study on predicting ovarian malignancy, we observed training AUCs close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behavior of random forests for probability estimation by (1) visualizing data space in three real-world case studies and (2) a simulation study.

Methods

For the case studies, multinomial risk estimates were visualized using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data-generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true AUC, and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 with binary outcomes were simulated, and random forest models were trained with minimum node size 2 or 20 using the ranger R package, resulting in 192 scenarios in total. Model performance was evaluated on large test datasets (N = 100,000).

No related works found for this paper.

Funding