Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Tsinghua University · Stanford University
Abstract
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and…
Citation impact
- FWCI
- —
- Percentile
- —
- References
- 37
Authors
5Topics & keywords
- Computer science
- Scalability
- Bandwidth (computing)
- Deep learning
- Stochastic gradient descent
- Compression ratio
- Artificial intelligence
- Computer engineering
- Industry, innovation and infrastructure