TRPO (Trust Region Policy Optimization)

Constrained Optimization

maximize E\big[r_t(θ) Â_t\big] subject to E\big[ KL(π_{θ_old} || π_θ) \big] ≤ δ

TRPO uses a second‑order approximation to solve this trust‑region problem and compute a step along the natural gradient direction using conjugate gradients.

Update Step

θ' = θ + √(\frac{2δ}{g^T H^{-1} g}) · H^{-1} g, g = ∇ L, H ≈ Fisher information

Backtracking line search ensures the KL constraint holds empirically on the batch.

Conjugate Gradient Details

We do not form H explicitly. Instead, we define Fisher‑vector products (Fv) via automatic differentiation on KL and use CG to solve Hx=g. Typically 10–20 iterations suffice.

# Pseudocode for CG with Fisher-vector product
def fisher_vec(v): return grad( KL(π_θ||π_{θ+εv}), ε ) |_{ε=0}
x = 0; r = g - fisher_vec(x); p = r; rs = r.T @ r
for _ in range(K):
  Ap = fisher_vec(p)
  α = rs / (p.T @ Ap + 1e-8)
  x = x + α p
  r = r - α Ap
  rs_new = r.T @ r
  if sqrt(rs_new) < tol: break
  p = r + (rs_new/rs) p; rs = rs_new

Practical Tips

Normalize advantages; subsample for KL eval during line search.
Backtracking factor 0.5–0.8; target KL δ ≈ 0.01–0.1 depending on task.
Use GAE(λ) to stabilize advantage estimation before TRPO updates.

Sketch (Conjugate Gradient)

compute g = ∇_θ L(θ)
solve Hx = g approximately via conjugate gradients using Fisher‑vector products
scale step: step = √(2δ / (x^T H x)) * x
backtracking line search until KL(θ, θ + step) ≤ δ and L improves
apply step to θ

Notes

More stable than vanilla policy gradient; heavier than PPO.
Requires Fisher‑vector product implementation and line search.
PPO approximates TRPO’s constraint with clipping for simplicity.