RLHF (Reinforcement Learning from Human Feedback)
Also known as: RLHF / Reinforcement Learning from Human Feedback / 人間のフィードバックからの強化学習
A training paradigm in which human raters compare model outputs, a reward model is trained on those preferences, and the LLM is then optimized via RL to match human intent — the technique that made ChatGPT conversationally useful.
Overview
RLHF aligns a pre-trained LLM with human intentions. Human annotators compare model outputs pairwise; a reward model (RM) is trained on these preference pairs. The LLM is then fine-tuned with PPO to maximize the RM's score, producing more helpful and less harmful responses.
Limitations and successors
Human annotation is expensive and PPO training is unstable. DPO (Direct Preference Optimization) addresses both issues by learning directly from preference data without a separate reward model, and has largely supplanted RLHF in recent alignment work.
Related Columns
Feel free to contact us
Contact Us