Cross-validation pitfalls when selecting and assessing regression and classification models
Redx Pharma (United Kingdom) · C4X Discovery (United Kingdom)
Abstract
We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches.
We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case.
Citation impact
- FWCI
- 30.86
- Percentile
- 100%
- References
- 33
Authors
4Topics & keywords
- Computer science
- Data mining
- Regression
- Cross-validation
- Model validation
- Regression analysis
- Data science
- Machine learning