Reinforce with Baseline

Approximations to Policy Gradient

对 Policy Gradient :

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} = \mathbb{E}_{A_t \sim \pi} \left[ \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( Q_\pi(s_t, A_t) - V_\pi(s_t) \right) \right]. \]

中的 \(Q_\pi(s_t, A_t)\) 和 \(V_\pi(s_t)\) 进行近似:

  • Recall that \(Q_\pi(s_t, a_t) = \mathbb{E}[U_t \mid s_t, a_t]\).

  • Monte Carlo approximation to \(Q_\pi(s_t, a_t) \approx u_t\) (REINFORCE):

  • Observing the trajectory: \(s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, \cdots, s_n, a_n, r_n\).

  • Compute return:

\[ u_t = \sum_{i=t}^{n} \gamma^{\,i-t} \cdot r_i. \]
  • \(u_t\) is an unbiased estimate of \(Q_\pi(s_t, a_t)\)

  • Approximate \(V(s;𝛉)\) by the value network, \(v(s;\mathbf{w})\).

最终 Approximate policy gradient 如下:

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} \approx 𝐠(a_t) \approx \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]

这里一共做了三次近似:

  1. Approximate expectation using one sample, \(a_t\). (Monte Carlo.)

  2. Approximate \(Q_\pi(s_t, a_t)\) by \(u_t\). (Another Monte Carlo.)

  3. Approximate \(V_\pi(s)\) by the value network, \(v(s; \mathbf{w})\).

Policy and Value Networks

上文中的 \(𝐠(a_t)\) 需要策略和价值两个网络,策略网络由于控制 agent ,价值网络作为 baseline 帮助训练策略网络。

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & \underset{n \text{ is count of actions}}{n × 1 \text{ vector}} \overset{\text{Softmax}}{→} \left\{\begin{matrix} \text{action 1, } p_1 \\ \text{action 2, } p_2 \\ ⋯ \\ \text{action } n \text{, } p_n \end{matrix}\right\} \pi(a ∣ s; 𝛉) \\ \overset{\mathrm{Dense2}}{→} & v(a ∣ s; 𝐰) \end{aligned} \right. \]

REINFORCE with Baseline

Updating the policy network

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} \approx \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]
  • Update policy network by policy gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]

将 \(u_t - v(s_t; \mathbf{w})\) 记作 \(-δ_t\) ,它是价值网络的预测与真实观测的回报 \(u_t\) 之间的差。

Update policy network by policy gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot -δ_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]

Updating the value network

  • Recall \(v(s_t; \mathbf{w})\) is an approximation to \(V_\pi(s_t) = \mathbb{E}[U_t \mid s_t]\).

  • Prediction error: \(\delta_t = v(s_t; \mathbf{w}) - u_t\).

  • Gradient:

\[ \frac{\partial \, \frac{\delta_t^2}{2}}{\partial \mathbf{w}} = \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]
  • Gradient descent:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

Summary of Algorithm

  • Play a game to the end and observe the trajectory:

\(s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n\).

  • Compute

\[ u_t = \sum_{i=t}^{n} \gamma^{\,i-t} \cdot r_i \quad \text{and} \quad \delta_t = v(s_t; \mathbf{w}) - u_t. \]
  • Update the policy network by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]
  • Update the value network by:

\[ \mathbf{w} \leftarrow \mathbf{w}- \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

对于每一个 \(t\) 从 \(t = 1\) 到 \(t = n\) 都重复一遍这段程序,共计重复 \(n\) 轮更新。

作于 2026-4-13