株式会社オブライト
AI2026-05-17

DPO (Direct Preference Optimization)

Also known as: DPO / Direct Preference Optimization / 直接選好最適化

An alignment method that optimizes an LLM directly on human preference pairs without training a separate reward model, offering simpler implementation and more stable training than RLHF.


Overview

Proposed by Stanford in 2023, DPO skips reward-model training and PPO, updating LLM parameters directly from (preferred, rejected) response pairs. It can be shown to find the same optimal solution as RLHF under certain assumptions, while being far simpler to implement and train stably.

Implementation

Libraries like TRL implement DPO in tens of lines of code. It is widely used to align open models such as Llama and Qwen with custom preference data for enterprise use cases.

Related Columns

Related Terms

Feel free to contact us

Contact Us