SAC (Soft Actor‑Critic)

Soft Objective

J(π) = Σ E\big[ r(s,a) + α·H(π(·|s)) \big]

The entropy term encourages diverse actions, improving exploration and robustness.

Soft Bellman Backup

Q(s,a) ← r + γ E_{s'∼P, a'∼π}[ Q(s',a') − α log π(a'|s') ]

π ← argmin_π E\big[ KL( π(·|s) || exp((1/α)Q(s,·)) / Z(s) ) \big]

In practice, update two Q‑nets (to reduce overestimation), and update a stochastic Gaussian policy by minimizing the KL surrogate.

Reparameterization Trick

a = μ(s) + σ(s) ⊙ ε,\; ε∼\mathcal{N}(0,I),\; \text{(tanh squashing for bounds)}

Policy gradients pass through stochastic actions using the reparameterization trick. A tanh‑squashed Gaussian ensures bounded actions; log‑prob must account for the tanh Jacobian.

Automatic Temperature Tuning

J(α) = E\big[ −α ( log π(a|s) + H_{target} ) \big],\; α ← α − η ∂J/∂α

This drives the policy entropy toward a target. Choose H_target ≈ −|A| for continuous actions as a starting point.

Stability & Tuning

Replay size 1e6; batch 256–1024; target update τ=0.005–0.02.
Use LayerNorm in policy; gradient clip at 1.0; normalize observations.
Examine log π and Q targets to detect collapse or instability.

Temperature α

Automatically tune α by minimizing J(α)=E[ −α ( log π(a|s) + H_target ) ]. This drives entropy toward a target value.

Pseudocode

initialize Q1, Q2, target networks, policy π_θ, temperature α
for each step:
  sample action a ~ π_θ(·|s), step env → (s', r)
  store (s,a,r,s') in replay
  sample batch from replay
  # Q update
  a' ~ π_θ(·|s') ; y = r + γ (min_i Q_i(s',a') - α log π_θ(a'|s'))
  minimize Σ_i (Q_i(s,a) - y)^2
  # Policy update
  a ~ π_θ(·|s) ; minimize α log π_θ(a|s) - min_i Q_i(s,a)
  # Temperature update (optional)
  minimize α * ( -log π_θ(a|s) - H_target )
  update target networks by Polyak averaging

Notes

Use twin Q networks and target smoothing for stability.
Replay buffer is essential; batch size 256–1024 often works well.
Target entropy ≈ −|A| for continuous actions is a good starting point.