QLoRA: Efficient Finetuning of Quantized LLMs
Indexed inarxivdatacite
Abstract
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically…
Citation impact
493
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Benchmark (surveying)
- Memory footprint
- Quantization (signal processing)
- Language model
- Artificial intelligence
- Algorithm
- Programming language
No related works found for this paper.