GRPO (Generalized Preference Optimization)

Motivation

Preference‑based training aligns a model to human choices using paired or grouped responses. GRPO generalizes direct preference optimization by weighting multiple candidates and optionally constraining updates via KL regularization or clipping, combining the strengths of DPO‑style losses and PPO‑style stability.

Setting & Notation

Given a prompt x and a set of K responses {y_i} from a language model (policy) π_θ(y|x), we want the policy to place higher probability mass on preferred responses. We denote a frozen reference policy π_ref (often the base SFT model) that anchors updates.

log π_θ(y|x) = \(\sum_{t=1}^{T(y)} log π_θ(y_t \mid x, y_{

To avoid length bias, use a normalized score s(y|x) = (1/T)·log π_θ(y|x) or add an explicit length penalty. We also define a scaled, reference‑relative score used by many preference losses:

S_θ(y|x) = β[ log π_θ(y|x) − log π_ref(y|x) ]

General Form

maximize E\Big[ \sum_i w_i(x, y_i) · log π_θ(y_i|x) \Big] − λ·E\big[ KL(π_θ(·|x)||π_ref(·|x)) \big]

Weights w_i can be derived from preference labels (e.g., +1/−1), Bradley‑Terry scores, or a learned reward model. A trust‑region flavor can be added by clipping probability ratios similar to PPO.

r_i = π_θ(y_i|x) / π_ref(y_i|x), use min( r_i · ŵ_i, clip(r_i,1−ε,1+ε) · ŵ_i )

Preference Models (Weights w_i)

Different preference assumptions induce different weights:

Pairwise Bradley–Terry (DPO case): P(y⁺ ≻ y⁻|x) = σ( S_θ(y⁺|x) − S_θ(y⁻|x) ).
Listwise Plackett–Luce: w_i ∝ exp(S_θ(y_i|x)) over the group, optionally using ranked positions.
Top‑k positives: w_i = 1/k for preferred items, −ρ/(K−k) for rejected, with ρ controlling negatives.
Reward‑weighted: w_i ∝ softmax(β·R(y_i|x)) if a reward model is available.

Using S_θ relative to π_ref stabilizes training by measuring changes against a well‑behaved base model.

Objective & Gradient

Define the GRPO objective (minimization form) with KL regularization and optional clipping surrogate:

L(θ) = E\Big[ −\sum_i w_i · log π_θ(y_i|x) + λ·KL(π_θ(·|x) || π_ref(·|x)) \Big]

L_clip(θ) = E\Big[ −\sum_i min\big( r_i·w_i, clip(r_i,1−ε,1+ε)·w_i \big) + λ·KL(·) \Big]\,,\; r_i = \frac{π_θ}{π_{ref}}

Gradient (ignoring dependence of w on θ for a simple, effective estimator):

∇_θL ≈ −E\Big[ \sum_i w_i · ∇_θlog π_θ(y_i|x) \Big] + λ ∇_θKL

Detaching w_i (stop‑gradient) avoids second‑order terms and is standard in practice.

Sequence Scoring & Length Bias

Sequence log‑probabilities favor shorter outputs. Common fixes:

Use average log‑prob: s̄(y)= (1/T) Σ log p.
Add length penalty: s_α(y)= log p(y) / T^α or −α·T.
Normalize weights within the group after applying penalties.

Always compute log‑probs with the same tokenizer and masking (ignore padding) for π_θ and π_ref.

Clipped vs Unclipped Surrogate

Clipping limits harmful updates when π_θ drifts from π_ref. The plot shows the conservative surrogate.

Clipping flattens the objective near the boundaries to prevent large steps.

Training Algorithm (LLMs)

# Inputs: prompts x, reference π_ref, current policy π_θ, K candidates each
for batch in dataloader:
  # 1) Sample K responses per prompt (top-p, T) or read from buffer
  Y = sample_candidates(model=π_θ, prompts=batch.x, K=K)
  # 2) Compute sequence log-probs under π_θ and π_ref (mask padding)
  logp = logprob(π_θ, batch.x, Y); logp_ref = logprob(π_ref, batch.x, Y)
  # 3) Build relative scores and weights
  S = β * (normalize(logp) - normalize(logp_ref))  # length-normalized
  w = listwise_softmax(S)  # or pairwise/heuristic weights
  # 4) Loss with KL or clipping surrogate
  if use_clipping:
      r = exp(logp - logp_ref)
      obj = mean( min(r*w, clip(r, 1-ε, 1+ε)*w) ) - λ * mean(KL(π_θ||π_ref))
      loss = -obj
  else:
      loss = -mean(w * logp) + λ * mean(KL(π_θ||π_ref))
  # 5) Update θ (detach w to avoid second-order terms)
  loss.backward(); clip_grad_norm_(model.parameters(), 1.0); optimizer.step(); optimizer.zero_grad()

Hyperparameters & Tuning

β (preference temperature): 0.1–4.0; larger emphasizes differences.
K (candidates per prompt): 2–8; higher K strengthens listwise signal.
λ (KL penalty): tune to match target KL per batch (0.02–0.2 typical).
ε (clipping): 0.1–0.3 if using PPO‑style surrogate.
Sampling: top‑p 0.9–0.95, temperature 0.7–1.0 for diverse candidates.
Length penalty α: 0.0–0.5 to curb verbosity; normalize before weighting.

Connections & Special Cases

DPO: K=2 pairwise with logistic loss on S_θ differences (no clipping).
IPO/KTO: Unregularized preference fitting on relative scores with temperature scaling.
ORPO: Token‑level odds ratio objective; related weighting at sequence level.
SFT: w_i=1 for gold responses, 0 otherwise (no reference term).

Evaluation & Diagnostics

Win‑rate vs reference or baseline on a held‑out prompt set.
Average length, toxicity/safety metrics, and factuality checks.
Per‑batch KL(π_θ||π_ref) to monitor drift.
Weight entropy H(w) to ensure groups aren’t degenerate (e.g., one heavy winner always).

Common Pitfalls

Length bias: fix via normalization/penalties and consistent tokenization.
Mode collapse: too high β or too low λ/ε; increase regularization or diversify sampling.
Noisy preferences: smooth with listwise softmax or aggregate multiple annotators.
Reference mismatch: large domain shift from π_ref can hurt; warm‑start SFT or update π_ref slowly.

Pseudocode

freeze π_ref
for minibatch of prompts x:
  sample K candidates {y_i} ~ current π_θ or from buffer
  compute weights w_i from preferences or scores
  if using clipping:
    r_i = π_θ(y_i|x)/π_ref(y_i|x)
    obj = mean(min(r_i*w_i, clip(r_i,1-ε,1+ε)*w_i)) - λ*KL(π_θ||π_ref)
  else:
    obj = mean(w_i * log π_θ(y_i|x)) - λ*KL(π_θ||π_ref)
  ascend θ on obj

GRPO (Generalized Preference Optimization)

Motivation

Setting & Notation

General Form

Preference Models (Weights w_i)

Objective & Gradient

Sequence Scoring & Length Bias

GRPO Pipeline (Diagram)

Clipped vs Unclipped Surrogate

Training Algorithm (LLMs)

Hyperparameters & Tuning

Connections & Special Cases

Evaluation & Diagnostics

Interactive: Listwise Weights

Common Pitfalls

Pseudocode

Notes

Motivation

Setting & Notation

General Form

Preference Models (Weights wi)

Objective & Gradient

Sequence Scoring & Length Bias

GRPO Pipeline (Diagram)

Clipped vs Unclipped Surrogate

Training Algorithm (LLMs)

Hyperparameters & Tuning

Connections & Special Cases

Evaluation & Diagnostics

Interactive: Listwise Weights

Common Pitfalls

Pseudocode

Notes

Preference Models (Weights w_i)