1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

Seide, Frank; Fu, Hao; Droppo, Jasha; Li, Gang; Yu, Dong

doi:10.21437/interspeech.2014-274

articleSep 14, 2014Closed access

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

FSFrank Seide HFHao Fu JDJasha Droppo GLGang Li DYDong Yu

Indexed incrossref

Abstract

We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k…

Citation impact

888

total citations

FWCI: 36.36
Percentile: 100%
References: 22

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Stochastic gradient descent
Computer science
Training (meteorology)
Gradient descent
Bit (key)
Speech recognition
Artificial intelligence
Computer network

No related works found for this paper.