preprintarXiv (Cornell University)May 23, 2023GREEN OA

QLoRA: Efficient Finetuning of Quantized LLMs

Indexed inarxivdatacite

Abstract

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically…

Citation impact

493
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Benchmark (surveying)
  • Memory footprint
  • Quantization (signal processing)
  • Language model
  • Artificial intelligence
  • Algorithm
  • Programming language
No related works found for this paper.

Funding