Splitwise: Efficient Generative LLM Inference Using Phase Splitting
University of Washington · Microsoft Research (United Kingdom)
Abstract
Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model…
Citation impact
- FWCI
- 46.65
- Percentile
- 100%
- References
- 78
Authors
7Topics & keywords
- Inference
- Generative grammar
- Computer science
- Phase (matter)
- Artificial intelligence
- Physics