Policy Gradient with Baseline

向策略梯度加入 baseline 可以降低方差,让收敛更快。

Baseline

  • Let the baseline, \(b\), be anything independent of \(A\).

\[ \begin{aligned} \mathbb{E}_{A \sim \pi} \left[ b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \right] &= b \cdot \mathbb{E}_{A \sim \pi} \left[ \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \right] \\ &= b \cdot \sum_a \pi(a \mid s; 𝛉)\, \frac{\partial \ln \pi(a \mid s; 𝛉)}{\partial 𝛉} \\ &= b \cdot \sum_a \pi(a \mid s; 𝛉)\, \left[ \frac{1}{\pi(a \mid s; 𝛉)} \cdot \frac{\partial \pi(a \mid s; 𝛉)}{\partial 𝛉} \right] \\ &= b \cdot \sum_a \frac{\partial \pi(a \mid s; 𝛉)}{\partial 𝛉} \\ &= b \cdot \frac{\partial \sum_a \pi(a \mid s; 𝛉)}{\partial 𝛉} \end{aligned} \]

显然 \(\pi(a \mid s; 𝛉)\) 是概率密度函数,\(\sum_a \pi(a \mid s; 𝛉) = 1\) ,因此:

\[ \begin{aligned} \mathbb{E}_{A \sim \pi} \left[ b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \right] &= b \cdot \frac{\partial 1}{\partial 𝛉} \\ &= 0 \end{aligned} \]

得出性质:if \(b\) is independent of \(A\), then

\[ \mathbb{E}_{A \sim \pi} \left[ b \cdot \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \right] = 0 \]
  • Policy gradient:

\[ \begin{aligned} \frac{\partial V_\pi(s)}{\partial 𝛉} &= \mathbb{E}_{A \sim \pi} \left[ \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, A) \right] - \mathbb{E}_{A \sim \pi} \left[ \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot b \right] \\ &= \mathbb{E}_{A \sim \pi} \left[ \frac{\partial \ln \pi(A \mid s; 𝛉)}{\partial 𝛉} \cdot \bigl(Q_\pi(s, A) - b\bigr) \right]. \end{aligned} \]

Theorem. If \(b\) is independent of \(A_t\), then policy gradient is:

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} = \mathbb{E}_{A_t \sim \pi} \left[ \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \bigl(Q_\pi(s_t, A_t) - b\bigr) \right]. \]

\(b\) 取任何值都不会影响期望。之所以要加上「去不去掉都毫无影响」的 \(b\) ,是因为算法里真正用的策略梯度不是这个公式,而是对它的蒙特卡洛近似——期望不受 \(b\) 影响,但蒙特卡洛近似会。如果选择的 \(b\) 比较好,接近于 \(Q_π\) ,那么 \(b\) 会让蒙特卡洛近似的方差降低,算法会收敛更快,

蒙特卡洛近似

\[ 𝐠(A_t) = \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \bigl(Q_\pi(s_t, A_t) - b\bigr) \]

( \(𝐠(a_t)\) 被称作 Stochastic policy gradient )

  • Randomly sample \(a_t \sim \pi(\cdot \mid s_t; 𝛉)\) and compute \(𝐠(a_t)\).

  • \(𝐠(a_t)\) is an unbiased estimate of the policy gradient:

\[ \mathbb{E}_{A_t \sim \pi} \left[ 𝐠(A_t) \right] = \frac{\partial V_\pi(s_t)}{\partial 𝛉}. \]
  • Stochastic policy gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot 𝐠(a_t). \]
  • Whatever \(b\) (independent of \(A_t\)) we use, the policy gradient \(\mathbb{E}_{A_t \sim \pi}[𝐠(A_t)]\) remains the same.

  • However, \(b\) affects \(𝐠(a_t)\).

  • A good \(b\) leads to small variance and speeds up convergence.

Baseline 的选择

\(b = 0\)

  • We can simply set \(b = 0\).

  • It becomes the standard policy gradient:

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} = \mathbb{E}_{A_t \sim \pi} \left[ \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s_t, A_t) \right]. \]

\(b = V_\pi(s_t)\)

  • Because \(s_t\) has been observed, \(b = V_\pi(s_t)\) is independent of \(A_t\).

  • Why using such a baseline?

  • \(V_\pi(s_t)\) is close to \(Q_\pi(s_t, A_t)\):

\[ V_\pi(s_t) = \mathbb{E}_{A_t}\left[ Q_\pi(s_t, A_t) \right]. \]
作于 2026-4-11