Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning
Indexed incrossref
Abstract
In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which…
Citation impact
510
total citations
- FWCI
- 38.43
- Percentile
- 100%
- References
- 39
Citations per year
Authors
3Topics & keywords
Topics
Keywords
- Scalability
- Computer science
- Speedup
- Overhead (engineering)
- Heuristic
- Convergence (economics)
- Artificial intelligence
- Deep learning
No related works found for this paper.