preprintJun 1, 2019Closed access

HAQ: Hardware-Aware Automated Quantization With Mixed Precision

Moscow Institute of Thermal Technology · Massachusetts Institute of Technology

Indexed incrossref

Abstract

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. There are plenty of specialized hardware for neural networks, but little research has been done for specialized neural network optimization for a particular hardware architecture. Conventional quantization algorithm ignores the different…

No related works found for this paper.

Funding