Towards Understanding Convergence and Generalization of AdamW
Singapore Management University · Peking University · +2 more institutions
Abstract
AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used $\ell _{2}$ -regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and $\ell _{2}$ -regularized Adam ( $\ell _{2}$ -Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and $\ell _{2}$ -Adam. Specifically, AdamW provably converges but…
Citation impact
- FWCI
- 66.45
- Percentile
- 100%
- References
- 68
Authors
4Topics & keywords
- Computer science
- Generalization
- Artificial intelligence
- Convergence (economics)
- Pattern recognition (psychology)
- Mathematics
- Reduced inequalities