Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Alibaba Group (United States)

Indexed incrossref

Abstract

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which…

Citation impact

510
total citations
FWCI
38.43
Percentile
100%
References
39
Citations per year

Authors

3

Topics & keywords

Keywords
  • Scalability
  • Computer science
  • Speedup
  • Overhead (engineering)
  • Heuristic
  • Convergence (economics)
  • Artificial intelligence
  • Deep learning
No related works found for this paper.