preprintarXiv (Cornell University)Sep 15, 2016GREEN OA

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Intel (United States) · Management Sciences (United States)

Indexed inarxivdatacite

Abstract

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima…

Citation impact

579
total citations
FWCI
Percentile
References
34
Citations per year

Authors

5

Topics & keywords

Keywords
  • Maxima and minima
  • Generalization
  • Stochastic gradient descent
  • Computer science
  • Deep learning
  • Batch processing
  • Noise (video)
  • Gradient descent
UN Sustainable Development Goals
  • No poverty
No related works found for this paper.