Training from pairwise preferences without a reward model
Given prompts x and pairwise preferences (y+ preferred over y−), DPO optimizes the policy directly to prefer y+ over y− relative to a frozen reference policy πref, avoiding an explicit reward model.
β scales preference strength. The reference term anchors updates to avoid degeneracy and preserves proximity to the base model.
freeze π_ref; initialize π_θ
for minibatch of (x, y_plus, y_minus):
lpos = log π_θ(y_plus|x); lneg = log π_θ(y_minus|x)
lpos_ref = log π_ref(y_plus|x); lneg_ref = log π_ref(y_minus|x)
z = β * (lpos - lneg) - (lpos_ref - lneg_ref)
loss = -mean(log_sigmoid(z))
update θ by minimizing loss