Maximum‑entropy RL for stability and exploration
The entropy term encourages diverse actions, improving exploration and robustness.
In practice, update two Q‑nets (to reduce overestimation), and update a stochastic Gaussian policy by minimizing the KL surrogate.
Policy gradients pass through stochastic actions using the reparameterization trick. A tanh‑squashed Gaussian ensures bounded actions; log‑prob must account for the tanh Jacobian.
This drives the policy entropy toward a target. Choose H_target ≈ −|A| for continuous actions as a starting point.
Automatically tune α by minimizing J(α)=E[ −α ( log π(a|s) + Htarget ) ]. This drives entropy toward a target value.
initialize Q1, Q2, target networks, policy π_θ, temperature α
for each step:
sample action a ~ π_θ(·|s), step env → (s', r)
store (s,a,r,s') in replay
sample batch from replay
# Q update
a' ~ π_θ(·|s') ; y = r + γ (min_i Q_i(s',a') - α log π_θ(a'|s'))
minimize Σ_i (Q_i(s,a) - y)^2
# Policy update
a ~ π_θ(·|s) ; minimize α log π_θ(a|s) - min_i Q_i(s,a)
# Temperature update (optional)
minimize α * ( -log π_θ(a|s) - H_target )
update target networks by Polyak averaging