QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, Tim; Pagnoni, Artidoro; Holtzman, Ari; Zettlemoyer, Luke

doi:10.48550/arxiv.2305.14314

preprintarXiv (Cornell University)May 23, 2023GREEN OA

QLoRA: Efficient Finetuning of Quantized LLMs

TDTim Dettmers APArtidoro Pagnoni AHAri Holtzman LZLuke Zettlemoyer

Indexed inarxivdatacite

Abstract

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically…

Citation impact

493

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Benchmark (surveying)
Memory footprint
Quantization (signal processing)
Language model
Artificial intelligence
Algorithm
Programming language

No related works found for this paper.

Funding

UO
University of Washington