Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott G.; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack A.; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared

doi:10.48550/arxiv.2204.05862

preprintarXiv (Cornell University)Apr 12, 2022GREEN OA

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

YBYuntao Bai AJAndy Jones KNKamal Ndousse AAAmanda Askell ACAnna Chen

Indexed inarxivdatacite

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy…

Citation impact

364

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

31

Topics & keywords

Topics

Keywords

Computer science
Reinforcement learning
Initialization
Robustness (evolution)
Automatic summarization
Artificial intelligence
Machine learning
Python (programming language)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.