HAQ: Hardware-Aware Automated Quantization With Mixed Precision
Moscow Institute of Thermal Technology · Massachusetts Institute of Technology
Abstract
Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. There are plenty of specialized hardware for neural networks, but little research has been done for specialized neural network optimization for a particular hardware architecture. Conventional quantization algorithm ignores the different…
Citation impact
- FWCI
- 51.34
- Percentile
- 100%
- References
- 57
Authors
5- KWKuan WangCorresponding
Moscow Institute of Thermal Technology, Massachusetts Institute of Technology
- ZLZhijian Liu
Moscow Institute of Thermal Technology, Massachusetts Institute of Technology
- YLYujun Lin
Massachusetts Institute of Technology, Moscow Institute of Thermal Technology
- JLJi Lin
Massachusetts Institute of Technology, Moscow Institute of Thermal Technology
- SHSong Han
Moscow Institute of Thermal Technology, Massachusetts Institute of Technology
Topics & keywords
- Computer science
- Quantization (signal processing)
- Hardware acceleration
- Edge device
- Computation
- Artificial neural network
- Computer hardware
- Design space exploration
- Affordable and clean energy