AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
Indexed incrossref
Abstract
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce cloud computing costs and protect users' privacy. However, the astronomical model size and the limited hardware resources pose significant deployment challenges. To solve these issues, we propose Activation-aware Weight Quantization (AWQ) and TinyChat, an algorithm-system full-stack solution for efficient on-device LLM deployment. AWQ is a novel quantization method that identifies and protects salient weights based on activation distribution, significantly reducing model size while preserving performance. TinyChat, an optimized inference…
Citation impact
164
total citations
- FWCI
- 157.04
- Percentile
- 100%
- References
- 13
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Quantization (signal processing)
- Acceleration
- Compression (physics)
- Computer science
- Materials science
- Composite material
- Physics
- Computer vision
No related works found for this paper.