SAC (Soft Actor‑Critic)

Maximum‑entropy RL for stability and exploration

Soft Objective

J(π) = Σ E\big[ r(s,a) + α·H(π(·|s)) \big]

The entropy term encourages diverse actions, improving exploration and robustness.

Soft Bellman Backup

Q(s,a) ← r + γ Es'∼P, a'∼π[ Q(s',a') − α log π(a'|s') ]
π ← argminπ E\big[ KL( π(·|s) || exp((1/α)Q(s,·)) / Z(s) ) \big]

In practice, update two Q‑nets (to reduce overestimation), and update a stochastic Gaussian policy by minimizing the KL surrogate.

Reparameterization Trick

a = μ(s) + σ(s) ⊙ ε,\; ε∼\mathcal{N}(0,I),\; \text{(tanh squashing for bounds)}

Policy gradients pass through stochastic actions using the reparameterization trick. A tanh‑squashed Gaussian ensures bounded actions; log‑prob must account for the tanh Jacobian.

Architecture Diagram

Policy π(a|s) Q1(s,a) Q2(s,a) Targets (Polyak) Min Q trick + entropy regularization
Twin Q networks reduce overestimation; policy minimizes KL to exp(Q/α).

Automatic Temperature Tuning

J(α) = E\big[ −α ( log π(a|s) + H_{target} ) \big],\; α ← α − η ∂J/∂α

This drives the policy entropy toward a target. Choose H_target ≈ −|A| for continuous actions as a starting point.

Stability & Tuning

Temperature α

Automatically tune α by minimizing J(α)=E[ −α ( log π(a|s) + Htarget ) ]. This drives entropy toward a target value.

Diagram

Q(s,a) π(a|s) entropy α·H Coupled updates between Q and stochastic policy
SAC alternates soft Q‑learning with a KL‑style policy update.

Pseudocode

initialize Q1, Q2, target networks, policy π_θ, temperature α
for each step:
  sample action a ~ π_θ(·|s), step env → (s', r)
  store (s,a,r,s') in replay
  sample batch from replay
  # Q update
  a' ~ π_θ(·|s') ; y = r + γ (min_i Q_i(s',a') - α log π_θ(a'|s'))
  minimize Σ_i (Q_i(s,a) - y)^2
  # Policy update
  a ~ π_θ(·|s) ; minimize α log π_θ(a|s) - min_i Q_i(s,a)
  # Temperature update (optional)
  minimize α * ( -log π_θ(a|s) - H_target )
  update target networks by Polyak averaging

Notes