Splitwise: Efficient Generative LLM Inference Using Phase Splitting

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Shah, Aashaka; Goiri, Íñigo; Maleki, Saeed; Bianchini, Ricardo

doi:10.1109/isca59077.2024.00019

articleJun 29, 2024Closed access

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

PPPratyush Patel ECEsha Choukse CZChaojie Zhang ASAashaka Shah ÍGÍñigo Goiri

University of Washington · Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

Generative large language model (LLM) applications are growing rapidly, leading to large-scale deployments of expensive and power-hungry GPUs. Our characterization of LLM inference shows that each inference request undergoes two phases: a compute-intensive prompt computation phase and a memory intensive token generation phase, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Unlike prompt computation, token generation does not need the compute capability of the latest GPUs and can be run with lower power and cost. Based on these insights, we propose Splitwise, a model…

Citation impact

156

total citations

FWCI: 46.65
Percentile: 100%
References: 78

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Inference
Generative grammar
Computer science
Phase (matter)
Artificial intelligence
Physics

No related works found for this paper.