株式会社オブライト
AI2026-05-17

RLHF (Reinforcement Learning from Human Feedback)

Also known as: RLHF / Reinforcement Learning from Human Feedback / 人間のフィードバックからの強化学習

A training paradigm in which human raters compare model outputs, a reward model is trained on those preferences, and the LLM is then optimized via RL to match human intent — the technique that made ChatGPT conversationally useful.


Overview

RLHF aligns a pre-trained LLM with human intentions. Human annotators compare model outputs pairwise; a reward model (RM) is trained on these preference pairs. The LLM is then fine-tuned with PPO to maximize the RM's score, producing more helpful and less harmful responses.

Limitations and successors

Human annotation is expensive and PPO training is unstable. DPO (Direct Preference Optimization) addresses both issues by learning directly from preference data without a separate reward model, and has largely supplanted RLHF in recent alignment work.

Related Columns

Related Terms

Feel free to contact us

Contact Us