ESE

Han, Song; Kang, Junlong; Mao, Huizi; Hu, Yiming; Li, Xin; Li, Yubin; Xie, Dongliang; Luo, Hong; Yao, Song; Wang, Yu; Yang, Huazhong; Dally, William J.

doi:10.1145/3020078.3021745

articleFeb 2, 2017ESClosed access

ESE

SHSong Han JKJunlong Kang HMHuizi Mao YHYiming Hu XLXin Li

Stanford University · Tsinghua University · +1 more institution

Indexed incrossref

Abstract

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that…

Citation impact

595

total citations

FWCI: 70.75
Percentile: 100%
References: 21

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Computer science
Pruning
Speedup
Computation
Quantization (signal processing)
Codebook
Parallel computing
Schedule

UN Sustainable Development Goals

Affordable and clean energy

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China