ESE
Stanford University · Tsinghua University · +1 more institution
Abstract
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that…
Citation impact
- FWCI
- 70.75
- Percentile
- 100%
- References
- 21
Authors
12Topics & keywords
- Computer science
- Pruning
- Speedup
- Computation
- Quantization (signal processing)
- Codebook
- Parallel computing
- Schedule
- Affordable and clean energy