Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Yu, Hao; Yang, Sen; Zhu, Shenghuo

doi:10.1609/aaai.v33i01.33015693

articleProceedings of the AAAI Conference on Artificial IntelligenceJul 17, 2019DIAMOND OA

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

HYHao Yu SYSen Yang SZShenghuo Zhu

Alibaba Group (United States)

Indexed incrossref

Abstract

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which…

Citation impact

510

total citations

FWCI: 38.43
Percentile: 100%
References: 39

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Scalability
Computer science
Speedup
Overhead (engineering)
Heuristic
Convergence (economics)
Artificial intelligence
Deep learning

No related works found for this paper.