Reinforce with Baseline
Approximations to Policy Gradient
对 Policy Gradient :
中的 \(Q_\pi(s_t, A_t)\) 和 \(V_\pi(s_t)\) 进行近似:
Recall that \(Q_\pi(s_t, a_t) = \mathbb{E}[U_t \mid s_t, a_t]\).
Monte Carlo approximation to \(Q_\pi(s_t, a_t) \approx u_t\) (REINFORCE):
Observing the trajectory: \(s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, \cdots, s_n, a_n, r_n\).
Compute return:
\(u_t\) is an unbiased estimate of \(Q_\pi(s_t, a_t)\)
Approximate \(V(s;𝛉)\) by the value network, \(v(s;\mathbf{w})\).
最终 Approximate policy gradient 如下:
这里一共做了三次近似:
Approximate expectation using one sample, \(a_t\). (Monte Carlo.)
Approximate \(Q_\pi(s_t, a_t)\) by \(u_t\). (Another Monte Carlo.)
Approximate \(V_\pi(s)\) by the value network, \(v(s; \mathbf{w})\).
Policy and Value Networks
上文中的 \(𝐠(a_t)\) 需要策略和价值两个网络,策略网络由于控制 agent ,价值网络作为 baseline 帮助训练策略网络。
REINFORCE with Baseline
Updating the policy network
Update policy network by policy gradient ascent:
将 \(u_t - v(s_t; \mathbf{w})\) 记作 \(-δ_t\) ,它是价值网络的预测与真实观测的回报 \(u_t\) 之间的差。
Update policy network by policy gradient ascent:
Updating the value network
Recall \(v(s_t; \mathbf{w})\) is an approximation to \(V_\pi(s_t) = \mathbb{E}[U_t \mid s_t]\).
Prediction error: \(\delta_t = v(s_t; \mathbf{w}) - u_t\).
Gradient:
Gradient descent:
Summary of Algorithm
Play a game to the end and observe the trajectory:
\(s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n\).
Compute
Update the policy network by:
Update the value network by:
对于每一个 \(t\) 从 \(t = 1\) 到 \(t = n\) 都重复一遍这段程序,共计重复 \(n\) 轮更新。
作于 2026-4-13