Towards Understanding Convergence and Generalization of AdamW

Singapore Management University · Peking University · +2 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used $\ell _{2}$ -regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and $\ell _{2}$ -regularized Adam ( $\ell _{2}$ -Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and $\ell _{2}$ -Adam. Specifically, AdamW provably converges but…

Citation impact

211
total citations
FWCI
66.45
Percentile
100%
References
68
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Generalization
  • Artificial intelligence
  • Convergence (economics)
  • Pattern recognition (psychology)
  • Mathematics
UN Sustainable Development Goals
  • Reduced inequalities
No related works found for this paper.

Funding