AI and Memory Wall

Gholami, Amir; Yao, Zhewei; Kim, Sehoon; Hooper, Coleman; Mahoney, Michael W.; Keutzer, Kurt

doi:10.1109/mm.2024.3373763

articleIEEE MicroMar 25, 2024Closed access

AI and Memory Wall

AGAmir Gholami ZYZhewei Yao SKSehoon Kim CHColeman Hooper MWMichael W. Mahoney

Berkeley College · University of California, Berkeley · +1 more institution

Indexed incrossref

Abstract

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0×/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the…

Citation impact

236

total citations

FWCI: 45.10
Percentile: 100%
References: 18

Citations per year

Authors

6

Topics & keywords

Topics

Ferroelectric and Negative Capacitance Devices25%

Keywords

Computer science
Computer architecture
Parallel computing

No related works found for this paper.