PPO (Proximal Policy Optimization)

Clipped surrogate policy gradient with trust‑region flavor

Surrogate Objective

rt(θ) = \(πθ(at|st) / πθold(at|st)\)
Lclip(θ) = E\big[ min( rtÂt, clip(rt, 1−ε, 1+ε) Ât ) \big]

The clipped term prevents destructive updates when the new policy deviates too far from the behavior policy that generated data.

Complete Loss

L(θ,ϕ) = E\big[ −Lclip(θ) + c − Vtarg)² − c] \big]

Combine policy, value, and entropy terms. Use GAE‑Λ to compute advantages.

Diagram

πθ vs πold ratio r clip(r) min(rÂ, clip·Â) Stabilized surrogate objective
Compute probability ratio, clip, then take the conservative surrogate.

Pseudocode

for iteration in range(K):
  collect trajectories with π_{θ_old}
  compute advantages  via GAE and normalize
  for epoch in range(E):
    for minibatch in data:
      r = π_θ(a|s) / π_{θ_old}(a|s)
      L_clip = mean(min(r*Â, clip(r,1-ε,1+ε)*Â))
      L_value = mse(V_ϕ(s), V_target)
      loss = -(L_clip - c_v*L_value + c_e*entropy)
      update θ, ϕ by gradient step
  θ_old ← θ

Tips

Derivation & GAE

Ât = \sum_{l=0}^{∞} (γλ)^l δ_{t+l}\,,\; δ_t = r_t + γV(s_{t+1}) − V(s_t)

GAE reduces variance with a bias‑variance knob λ. PPO then maximizes a clipped surrogate of the importance‑sampled advantage, preventing destructive updates when r=π_θ/π_old drifts.

Training Schedule

Diagram: Clipping Effect

Unclipped r (dark) vs clipped min(rÂ,clip(r)Â) (light)
Clipping flattens the surrogate near ratio bounds to enforce a trust region.

Interactive: Clipping Explorer

Ratio r
1.000
Epsilon ε
trust region
Advantage sign
Unclipped r·Â
objective
Clipped min(clip(r)·Â, r·Â)
conservative

Stability Notes