Cross-validation pitfalls when selecting and assessing regression and classification models

Krstajić, Damjan; Buturović, Ljubomir; Leahy, David E.; Thomas, Simon

doi:10.1186/1758-2946-6-10

articleJournal of CheminformaticsMar 29, 2014GOLD OA

Cross-validation pitfalls when selecting and assessing regression and classification models

DKDamjan Krstajić LBLjubomir Buturović DEDavid E. Leahy STSimon Thomas

Redx Pharma (United Kingdom) · C4X Discovery (United Kingdom)

PubMed

Indexed incrossrefdoajpubmed

Abstract

Background

We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches.

Methods

We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case.

Citation impact

1,051

total citations

FWCI: 30.86
Percentile: 100%
References: 33

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Data mining
Regression
Cross-validation
Model validation
Regression analysis
Data science
Machine learning

No related works found for this paper.