Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Woosuk; Li, Z.; Zhuang, Siyuan; Sheng, Ying; Zheng, L; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion

doi:10.1145/3600006.3613165

articleOct 3, 2023GOLD OA

Efficient Memory Management for Large Language Model Serving with PagedAttention

WKWoosuk Kwon ZLZ. Li SZSiyuan Zhuang YSYing Sheng LZL Zheng

Berkeley College · University of California, Berkeley · +2 more institutions

Indexed incrossref

Abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests…

Citation impact

1,009

total citations

FWCI: 166.80
Percentile: 100%
References: 19

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Computer science
Paging
Cache
Demand paging
Parallel computing
Memory management
Virtual memory
Cache coloring

No related works found for this paper.