articleJun 29, 2024Closed access

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

University of Washington · Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model…

No related works found for this paper.