DPO (Direct Preference Optimization)

Training from pairwise preferences without a reward model

Setup

Given prompts x and pairwise preferences (y⁺ preferred over y⁻), DPO optimizes the policy directly to prefer y⁺ over y⁻ relative to a frozen reference policy π_ref, avoiding an explicit reward model.

Objective

L(θ) = − E\big[ log σ\big( β( log π_θ(y⁺|x) − log π_θ(y⁻|x) ) − ( log π_ref(y⁺|x) − log π_ref(y⁻|x) ) \big) \big]

β scales preference strength. The reference term anchors updates to avoid degeneracy and preserves proximity to the base model.

Diagram

_θ(y|x) Reference π_ref(y|x) → Preference loss

Policy favors preferred completions over rejected ones relative to a reference.

Pseudocode

freeze π_ref; initialize π_θ
for minibatch of (x, y_plus, y_minus):
  lpos = log π_θ(y_plus|x); lneg = log π_θ(y_minus|x)
  lpos_ref = log π_ref(y_plus|x); lneg_ref = log π_ref(y_minus|x)
  z = β * (lpos - lneg) - (lpos_ref - lneg_ref)
  loss = -mean(log_sigmoid(z))
  update θ by minimizing loss

Notes

Works well for RLHF without reward modeling and PPO loops.
Batch multiple candidates per prompt to improve signal.
Calibrate β; too large can overfit and drift from reference.