articleSep 14, 2014Closed access

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

Indexed incrossref

Abstract

We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k…

Citation impact

888
total citations
FWCI
36.36
Percentile
100%
References
22
Citations per year

Authors

5

Topics & keywords

Keywords
  • Stochastic gradient descent
  • Computer science
  • Training (meteorology)
  • Gradient descent
  • Bit (key)
  • Speech recognition
  • Artificial intelligence
  • Computer network
No related works found for this paper.