On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, Nitish Shirish; Mudigere, Dheevatsa; Nocedal, Jorge; Smelyanskiy, Mikhail; Tang, Ping

doi:10.48550/arxiv.1609.04836

preprintarXiv (Cornell University)Sep 15, 2016GREEN OA

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

NSNitish Shirish Keskar DMDheevatsa Mudigere JNJorge Nocedal MSMikhail Smelyanskiy PTPing Tang

Intel (United States) · Management Sciences (United States)

Indexed inarxivdatacite

Abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima…

Citation impact

579

total citations

FWCI: —
Percentile: —
References: 34

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Maxima and minima
Generalization
Stochastic gradient descent
Computer science
Deep learning
Batch processing
Noise (video)
Gradient descent

UN Sustainable Development Goals

No poverty

No related works found for this paper.