articleSep 14, 2014Closed access
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs
Indexed incrossref
Abstract
We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k…
Citation impact
888
total citations
- FWCI
- 36.36
- Percentile
- 100%
- References
- 22
Citations per year
Authors
5Topics & keywords
Topics
Keywords
- Stochastic gradient descent
- Computer science
- Training (meteorology)
- Gradient descent
- Bit (key)
- Speech recognition
- Artificial intelligence
- Computer network
No related works found for this paper.