Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Indexed inarxivdatacite
Abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy…
Citation impact
364
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
31Topics & keywords
Topics
Keywords
- Computer science
- Reinforcement learning
- Initialization
- Robustness (evolution)
- Automatic summarization
- Artificial intelligence
- Machine learning
- Python (programming language)
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.