DPO (Direct Preference Optimization)

Training from pairwise preferences without a reward model

Setup

Given prompts x and pairwise preferences (y+ preferred over y), DPO optimizes the policy directly to prefer y+ over y relative to a frozen reference policy πref, avoiding an explicit reward model.

Objective

L(θ) = − E\big[ log σ\big( β( log πθ(y+|x) − log πθ(y|x) ) − ( log πref(y+|x) − log πref(y|x) ) \big) \big]

β scales preference strength. The reference term anchors updates to avoid degeneracy and preserves proximity to the base model.

Diagram

Prompt x Policy πθ(y|x) Reference πref(y|x) → Preference loss
Policy favors preferred completions over rejected ones relative to a reference.

Pseudocode

freeze π_ref; initialize π_θ
for minibatch of (x, y_plus, y_minus):
  lpos = log π_θ(y_plus|x); lneg = log π_θ(y_minus|x)
  lpos_ref = log π_ref(y_plus|x); lneg_ref = log π_ref(y_minus|x)
  z = β * (lpos - lneg) - (lpos_ref - lneg_ref)
  loss = -mean(log_sigmoid(z))
  update θ by minimizing loss

Notes