AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

Lin, Ji; Tang, Jiaming; Tang, Haotian; Yang, Shang; Xiao, Guangxuan; Han, Song

doi:10.1145/3714983.3714987

articleGetMobile Mobile Computing and CommunicationsJan 20, 2025Closed access

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

JLJi Lin JTJiaming Tang HTHaotian Tang SYShang Yang GXGuangxuan Xiao

IIT@MIT

Indexed incrossref

Abstract

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce cloud computing costs and protect users' privacy. However, the astronomical model size and the limited hardware resources pose significant deployment challenges. To solve these issues, we propose Activation-aware Weight Quantization (AWQ) and TinyChat, an algorithm-system full-stack solution for efficient on-device LLM deployment. AWQ is a novel quantization method that identifies and protects salient weights based on activation distribution, significantly reducing model size while preserving performance. TinyChat, an optimized inference…

Citation impact

164

total citations

FWCI: 157.04
Percentile: 100%
References: 13

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Quantization (signal processing)
Acceleration
Compression (physics)
Computer science
Materials science
Composite material
Physics
Computer vision

No related works found for this paper.