向策略梯度加入 baseline 可以降低方差,让收敛更快。
Baseline
\[
\begin{aligned}
\mathbb{E}_{A \sim \pi} \left[
b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉}
\right]
&= b \cdot \mathbb{E}_{A \sim \pi} \left[
\frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉}
\right] \\
&= b \cdot \sum_a \pi(a \mid s; 𝛉)\,
\frac{\partial \ln \pi(a \mid s; 𝛉)}{\partial 𝛉} \\
&= b \cdot \sum_a \pi(a \mid s; 𝛉)\,
\left[ \frac{1}{\pi(a \mid s; 𝛉)} \cdot \frac{\partial \pi(a \mid s; 𝛉)}{\partial 𝛉} \right] \\
&= b \cdot \sum_a \frac{\partial \pi(a \mid s; 𝛉)}{\partial 𝛉} \\
&= b \cdot \frac{\partial \sum_a \pi(a \mid s; 𝛉)}{\partial 𝛉}
\end{aligned}
\]
显然 \(\pi(a \mid s; 𝛉)\) 是概率密度函数,\(\sum_a \pi(a \mid s; 𝛉) = 1\) ,因此:
\[
\begin{aligned}
\mathbb{E}_{A \sim \pi} \left[
b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉}
\right]
&= b \cdot \frac{\partial 1}{\partial 𝛉} \\
&= 0
\end{aligned}
\]
得出性质:if \(b\) is independent of \(A\), then
\[
\mathbb{E}_{A \sim \pi} \left[
b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉}
\right] = 0
\]
\[
\begin{aligned}
\frac{\partial V_\pi(s)}{\partial 𝛉}
&= \mathbb{E}_{A \sim \pi} \left[
\frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, A)
\right] - \mathbb{E}_{A \sim \pi} \left[
\frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot b
\right] \\
&= \mathbb{E}_{A \sim \pi} \left[
\frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot \bigl(Q_\pi(s, A) - b\bigr)
\right].
\end{aligned}
\]
Theorem. If \(b\) is independent of \(A_t\), then policy gradient is:
\[
\frac{\partial V_\pi(s_t)}{\partial 𝛉}
=
\mathbb{E}_{A_t \sim \pi} \left[
\frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉}
\cdot \bigl(Q_\pi(s_t, A_t) - b\bigr)
\right].
\]
\(b\) 取任何值都不会影响期望。之所以要加上「去不去掉都毫无影响」的 \(b\) ,是因为算法里真正用的策略梯度不是这个公式,而是对它的蒙特卡洛近似——期望不受 \(b\) 影响,但蒙特卡洛近似会。如果选择的 \(b\) 比较好,接近于 \(Q_π\) ,那么 \(b\) 会让蒙特卡洛近似的方差降低,算法会收敛更快,
蒙特卡洛近似
令
\[
𝐠(A_t) = \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉}
\cdot \bigl(Q_\pi(s_t, A_t) - b\bigr)
\]
( \(𝐠(a_t)\) 被称作 Stochastic policy gradient )
\[
\mathbb{E}_{A_t \sim \pi} \left[ 𝐠(A_t) \right]
=
\frac{\partial V_\pi(s_t)}{\partial 𝛉}.
\]
\[
𝛉 \leftarrow 𝛉 + \beta \cdot 𝐠(a_t).
\]
Whatever \(b\) (independent of \(A_t\)) we use, the policy gradient \(\mathbb{E}_{A_t \sim \pi}[𝐠(A_t)]\) remains the same.
However, \(b\) affects \(𝐠(a_t)\).
A good \(b\) leads to small variance and speeds up convergence.
Baseline 的选择
\(b = 0\)
\[
\frac{\partial V_\pi(s_t)}{\partial 𝛉}
= \mathbb{E}_{A_t \sim \pi} \left[
\frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉}
\cdot Q_\pi(s_t, A_t)
\right].
\]
\(b = V_\pi(s_t)\)
Because \(s_t\) has been observed, \(b = V_\pi(s_t)\) is independent of \(A_t\).
Why using such a baseline?
\(V_\pi(s_t)\) is close to \(Q_\pi(s_t, A_t)\):
\[
V_\pi(s_t) = \mathbb{E}_{A_t}\left[ Q_\pi(s_t, A_t) \right].
\]
作于 2026-4-11