On the Variance of the Adaptive Learning Rate and Beyond

Liu, Liyuan; Jiang, Haoming; He, Pengcheng; Chen, Weizhu; Liu, Xiaodong; Gao, Jianfeng; Han, Jiawei

doi:10.48550/arxiv.1908.03265

preprintarXiv (Cornell University)Aug 8, 2019GREEN OA

On the Variance of the Adaptive Learning Rate and Beyond

LLLiyuan Liu HJHaoming Jiang PHPengcheng He WCWeizhu Chen XLXiaodong Liu

Georgia Institute of Technology · Microsoft Research (United Kingdom)

Indexed inarxivdatacite

Abstract

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image…

Citation impact

608

total citations

FWCI: —
Percentile: —
References: 28

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Variance reduction
Robustness (evolution)
Implementation
Machine learning
Variance (accounting)
Rate of convergence

No related works found for this paper.