articleJournal of CheminformaticsMar 29, 2014GOLD OA

Cross-validation pitfalls when selecting and assessing regression and classification models

Redx Pharma (United Kingdom) · C4X Discovery (United Kingdom)

PubMed
Indexed incrossrefdoajpubmed

Abstract

Background

We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches.

Methods

We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case.

Citation impact

1,051
total citations
FWCI
30.86
Percentile
100%
References
33
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Data mining
  • Regression
  • Cross-validation
  • Model validation
  • Regression analysis
  • Data science
  • Machine learning
No related works found for this paper.