articlePatternsAug 4, 2023GOLD OA

Leakage and the reproducibility crisis in machine-learning-based science

Princeton University · Center for Information Technology

PubMed
Indexed incrossrefdoajpubmed

Abstract

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we…

Citation impact

618
total citations
FWCI
101.08
Percentile
100%
References
108
Citations per year

Authors

2

Topics & keywords

Keywords
  • Reproducibility
  • Leakage (economics)
  • Computer science
  • Logistic regression
  • Regression
  • Artificial intelligence
  • Machine learning
  • Data science
No related works found for this paper.

Funding