preprintarXiv (Cornell University)Aug 8, 2019GREEN OA

On the Variance of the Adaptive Learning Rate and Beyond

Georgia Institute of Technology · Microsoft Research (United Kingdom)

Indexed inarxivdatacite

Abstract

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image…

Citation impact

608
total citations
FWCI
Percentile
References
28
Citations per year

Authors

7

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Variance reduction
  • Robustness (evolution)
  • Implementation
  • Machine learning
  • Variance (accounting)
  • Rate of convergence
No related works found for this paper.