articleFeb 2, 2017ESClosed access

ESE

Stanford University · Tsinghua University · +1 more institution

Indexed incrossref

Abstract

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that…

Citation impact

595
total citations
FWCI
70.75
Percentile
100%
References
21
Citations per year

Authors

12

Topics & keywords

Keywords
  • Computer science
  • Pruning
  • Speedup
  • Computation
  • Quantization (signal processing)
  • Codebook
  • Parallel computing
  • Schedule
UN Sustainable Development Goals
  • Affordable and clean energy
No related works found for this paper.

Funding