RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Chaudhari, Shreyas; Aggarwal, Pranjal; Murahari, Vishvak; Rajpurohit, Tanmay; Kalyan, Ashwin; Narasimhan, Karthik; Deshpande, Ameet; Silva, Bruno Castro da

doi:10.1145/3743127

reviewACM Computing SurveysJun 5, 2025HYBRID OA

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

SCShreyas Chaudhari PAPranjal Aggarwal VMVishvak Murahari TRTanmay Rajpurohit AKAshwin Kalyan

University of Massachusetts Amherst · Carnegie Mellon University · +3 more institutions

Indexed incrossref

Abstract

A significant challenge in training large language models (LLMs) as effective assistants is aligning them with human preferences. Reinforcement learning from human feedback (RLHF) has emerged as a promising solution. However, our understanding of RLHF is often limited to initial design choices. This article analyzes RLHF through reinforcement learning principles, focusing on the reward model. It examines modeling choices and function approximation caveats, highlighting assumptions about reward expressivity and revealing limitations like incorrect generalization, model misspecification, and sparse feedback. A categorical review of current literature provides insights for researchers to understand the challenges…

Citation impact

43

total citations

FWCI: 81.95
Percentile: 100%
References: 34

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Reinforcement learning
Human–computer interaction
Artificial intelligence

No related works found for this paper.