Efficient Memory Management for Large Language Model Serving with PagedAttention
Berkeley College · University of California, Berkeley · +2 more institutions
Abstract
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests…
Citation impact
- FWCI
- 166.80
- Percentile
- 100%
- References
- 19
Authors
9- WKWoosuk KwonCorresponding
Berkeley College, University of California, Berkeley
- ZLZ. Li
Berkeley College, University of California, Berkeley
- SZSiyuan Zhuang
Berkeley College, University of California, Berkeley
- YSYing Sheng
University of California, Berkeley, Stanford University
- LZL Zheng
Berkeley College, University of California, Berkeley
Topics & keywords
- Computer science
- Paging
- Cache
- Demand paging
- Parallel computing
- Memory management
- Virtual memory
- Cache coloring