PPO (Proximal Policy Optimization)

Clipped surrogate policy gradient with trust‑region flavor

Surrogate Objective

r_t(θ) = \(π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)\)

L_clip(θ) = E\big[ min( r_tÂ_t, clip(r_t, 1−ε, 1+ε) Â_t ) \big]

The clipped term prevents destructive updates when the new policy deviates too far from the behavior policy that generated data.

Complete Loss

L(θ,ϕ) = E\big[ −L_clip(θ) + c_vϕ − V_targ)² − c_eθ] \big]

Combine policy, value, and entropy terms. Use GAE‑Λ to compute advantages.

Diagram

_θ vs π_old → ratio r → clip(r) → min(rÂ, clip·Â) Stabilized surrogate objective

Compute probability ratio, clip, then take the conservative surrogate.

Pseudocode

for iteration in range(K):
  collect trajectories with π_{θ_old}
  compute advantages Â via GAE and normalize
  for epoch in range(E):
    for minibatch in data:
      r = π_θ(a|s) / π_{θ_old}(a|s)
      L_clip = mean(min(r*Â, clip(r,1-ε,1+ε)*Â))
      L_value = mse(V_ϕ(s), V_target)
      loss = -(L_clip - c_v*L_value + c_e*entropy)
      update θ, ϕ by gradient step
  θ_old ← θ

Tips

Typical ε in [0.1, 0.3]; use 2–10 epochs with minibatches.
Normalize advantages; clip value loss to avoid value explosion.
Early stop epochs if KL(π_θ||π_old) exceeds threshold.

Derivation & GAE

Â_t = \sum_{l=0}^{∞} (γλ)^l δ_{t+l}\,,\; δ_t = r_t + γV(s_{t+1}) − V(s_t)

GAE reduces variance with a bias‑variance knob λ. PPO then maximizes a clipped surrogate of the importance‑sampled advantage, preventing destructive updates when r=π_θ/π_old drifts.

Training Schedule

Collect T steps across N envs → batch B=N·T.
Compute Â with GAE(λ), normalize per batch; bootstrap V.
Run E epochs over B with minibatch size M; shuffle each epoch.
Track KL; early stop epochs if KL > KL_target (e.g., 0.02–0.1).

Diagram: Clipping Effect

Clipping flattens the surrogate near ratio bounds to enforce a trust region.

Interactive: Clipping Explorer

Ratio r

1.000

Epsilon ε

trust region

Advantage sign

positive negative

Unclipped r·Â

—

objective

Clipped min(clip(r)·Â, r·Â)

—

conservative

Stability Notes

Clip value loss (|V−V_target| ≤ c) to avoid overfitting critic.
Entropy bonus anneal; normalize observations; clip grads at 0.5–1.0.
Use separate LR for actor/critic when critic dominates loss.