preprintarXiv (Cornell University)Apr 12, 2022GREEN OA

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Indexed inarxivdatacite

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy…

Citation impact

364
total citations
FWCI
Percentile
References
0
Citations per year

Authors

31

Topics & keywords

Keywords
  • Computer science
  • Reinforcement learning
  • Initialization
  • Robustness (evolution)
  • Automatic summarization
  • Artificial intelligence
  • Machine learning
  • Python (programming language)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.