Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea

doi:10.48550/arxiv.2305.18290

preprintarXiv (Cornell University)May 29, 2023GREEN OA

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

RRRafael Rafailov ASArchit Sharma EMEric Mitchell SEStefano Ermon CDChristopher D. Manning

Indexed inarxivdatacite

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting…

Citation impact

274

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Automatic summarization
Hyperparameter
Computer science
Reinforcement learning
Artificial intelligence
Machine learning
Preference
Preference learning

No related works found for this paper.