Actor-Critic Method

Value Network 和 Policy Network 的结合

Value Network (critic): 用神经网络 \(q(s, a; 𝐰)\) 近似 \(Q_π(s, a)\) ;

Policy Network (actor): 用神经网络 \(π(a ∣ s; 𝛉)\) 近似 \(π(a ∣ s)\) 。

所以,对于 State-value function :

\[ V_π(s) = ∑_a π(a ∣ s) ⋅ Q_π(s, a) ≈ ∑_a π(a ∣ s; 𝛉) ⋅ q(s, a; 𝐰) = V(s; 𝛉; 𝐰) \]

训练 \(V(s; 𝛉; 𝐰)\) 的流程(更新参数 \(𝛉\) 和 \(𝐰\) ):

  • 观察 state \(s_1\);

  • 根据 \(π(⋅ ∣ s_t; 𝛉_t)\) 随机取样 action \(a_t\) ;

  • 发起 \(a_t\) 并得到新 state \(s_{t + 1}\) 和 reward \(r_t\) ;

  • 用 TD 更新 \(𝐰\)(value network)

  • Compute \(q(s_t, a_t; 𝐰_t)\) 和 \(q(s_{t + 1}, a_{t + 1}; 𝐰_t)\);

  • TD Target: \(y_t = r_t + γ ⋅ q(s_{t + 1}, a_{t + 1}; 𝐰)\);

  • Loss: \(L(𝐰) = \frac{1}{2}[q(s_t, a_t; 𝐰_t) - y_t] ^ 2\)

  • Gradient descent: \(𝐰_{t + 1} = 𝐰_t - α ⋅ (q_t - y_t) ⋅ \frac{𝜕 L(𝐰)}{𝜕 𝐰}∣_{𝐰 = 𝐰_t}\)

  • 用 policy gradient 更新 \(𝛉\)(在 policy network)。

  • Let \(g(a, 𝛉) = \frac{𝜕 \log π(a ∣ s, 𝛉)}{𝜕 𝛉} ⋅ q(s_t, a; 𝐰)\);

  • \(\frac{𝜕 V(s; 𝛉, w_t)}{𝜕𝛉} = 𝔼_A [g(A, 𝛉)]\);

  • Random sampling: \(a ∼ π(⋅ ∣ s_t; 𝛉_t)\); (Thus \(g(a, 𝛉)\) is unbiased.)

  • Stochastic gradient ascent: \(𝛉_{t + 1} = 𝛉_t + β ⋅ g(a, 𝛉_t)\).

flow chart:         Action a
   ┌──────────────────┬──────────────────────┐
   │                  v                      │
Policy    Value q   Value     Reward r       v
Network <────────── Network <────────── Environment
(Actor)             (Critic)                 │
   │                  ^                      │
   └──────────────────┴──────────────────────┘
                    State s

算法总结:

  1. Observe state \(s_t\) and randomly sample \(a_t \sim \pi(\cdot \mid s_t; 𝛉_t)\).

  2. Perform \(a_t\); then environment gives new state \(s_{t+1}\) and reward \(r_t\).

  3. Randomly sample \(\tilde{a}_{t+1} ∼ \pi(\cdot \mid s_{t+1}; 𝛉_t)\). (Do not perform \(\tilde{a}_{t+1}\)!)

  4. Evaluate value network: \(q_t = q(s_t, a_t; 𝐰_t)\) and \(q_{t+1} = q(s_{t+1}, \tilde{a}_{t+1}; 𝐰_t)\).

  5. Compute TD error: \(\delta_t = q_t - (r_t + γ \cdot q_{t+1})\).

  6. Differentiate value network: \(d_{w,t} = \left.\frac{𝜕 q(s_t, a_t; 𝐰)}{\partial 𝐰}\right|_{𝐰 = 𝐰_t}\).

  7. Update value network: \(𝐰_{t+1} = 𝐰_t - α ⋅ δ_t \cdot d_{w,t}\).

  8. Differentiate policy network: \(d_{θ,t} = \left.\frac{\partial \log \pi(a_t \mid s_t, 𝛉)}{\partial 𝛉}\right|_{𝛉 = 𝛉_t}\).

  9. Update policy network: \(𝛉_{t+1} = 𝛉_t + \beta \cdot q_t \cdot d_{θ,t}\).

也有论文或书籍用 \(δ_t\) 替代第 9 步 \(𝛉_{t+1} = 𝛉_t + \beta \cdot q_t \cdot d_{θ,t}\) 中的 \(q_t\) 的。两者都是正确的,用 \(q_t\) 是标准算法,用 \(δ_t\) 的叫做 policy gradient with baseline (baseline is \(r_t + γ ⋅ q_{t + 1}\))。两者策略梯度的期望完全相等,但使用 \(δ_t\) 的话效果更好,因为有 baseline 可以使方差更小,算法收敛更快。
作于 2026-4-4