Combine policy, value, and entropy terms. Use GAE‑Λ to compute advantages.
Diagram
Compute probability ratio, clip, then take the conservative surrogate.
Pseudocode
for iteration in range(K):
collect trajectories with π_{θ_old}
compute advantages  via GAE and normalize
for epoch in range(E):
for minibatch in data:
r = π_θ(a|s) / π_{θ_old}(a|s)
L_clip = mean(min(r*Â, clip(r,1-ε,1+ε)*Â))
L_value = mse(V_ϕ(s), V_target)
loss = -(L_clip - c_v*L_value + c_e*entropy)
update θ, ϕ by gradient step
θ_old ← θ
Tips
Typical ε in [0.1, 0.3]; use 2–10 epochs with minibatches.
Normalize advantages; clip value loss to avoid value explosion.
Early stop epochs if KL(πθ||πold) exceeds threshold.
GAE reduces variance with a bias‑variance knob λ. PPO then maximizes a clipped surrogate of the importance‑sampled advantage, preventing destructive updates when r=π_θ/π_old drifts.
Training Schedule
Collect T steps across N envs → batch B=N·T.
Compute  with GAE(λ), normalize per batch; bootstrap V.
Run E epochs over B with minibatch size M; shuffle each epoch.
Track KL; early stop epochs if KL > KL_target (e.g., 0.02–0.1).
Diagram: Clipping Effect
Clipping flattens the surrogate near ratio bounds to enforce a trust region.
Interactive: Clipping Explorer
Ratio r
1.000
Epsilon ε
trust region
Advantage sign
Unclipped r·Â
—
objective
Clipped min(clip(r)·Â, r·Â)
—
conservative
Stability Notes
Clip value loss (|V−V_target| ≤ c) to avoid overfitting critic.
Entropy bonus anneal; normalize observations; clip grads at 0.5–1.0.
Use separate LR for actor/critic when critic dominates loss.