Advantage Actor-Critic

两个神经网络:

  • Policy network (actor): \(\pi(a \mid s; 𝛉)\)

  • It is an approximation to the policy function, \(\pi(a \mid s)\).

  • It controls the agent.

  • Value network (critic): \(v(s; \mathbf{w})\)

  • It is an approximation to the state-value function, \(V_\pi(s)\).

  • It evaluates how good the state \(s\) is.

整体结构和Reinforce with Baseline的相同:

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & \underset{n \text{ is count of actions}}{n × 1 \text{ vector}} \overset{\text{Softmax}}{→} \left\{\begin{matrix} \text{action 1, } p_1 \\ \text{action 2, } p_2 \\ ⋯ \\ \text{action } n \text{, } p_n \end{matrix}\right\} \pi(a ∣ s; 𝛉) \\ \overset{\mathrm{Dense2}}{→} & v(a ∣ s; 𝐰) \end{aligned} \right. \]
A2C 的 \(v(a ∣ s; 𝐰)\) 充当 critic ,而 REINFORCE 的只充当 baseline ,不会评价动作的好坏。

二者区别在于训练上。A2C 的训练流程如下:

  • Observe a transition \((s_t, a_t, r_t, s_{t+1})\).

  • TD target:

\[ y_t = r_t + \gamma \cdot v(s_{t+1}; \mathbf{w}). \]
  • TD error:

\[ \delta_t = v(s_t; \mathbf{w}) - y_t. \]
  • Update the policy network (actor) by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]
  • Update the value network (critic) by:

\[ \mathbf{w} \leftarrow \mathbf{w}- \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

A2C with Multi-Step TD Target

上文 A2C 的训练流程用的是 one-step TD target(因为 TD Target 只包含了一个奖励 \(r_t\) ),这里可以换用 multi-step TD target 以改善效果。 one-step TD target V.S. multi-step TD target:

  • Observing a transition \((s_t, a_t, r_t, s_{t+1})\).

  • One-step TD target:

\[ y_t = r_t + \gamma \cdot v(s_{t+1}; \mathbf{w}). \]
  • Observing \(m\) transitions:

\[ \{(s_{t+i}, a_{t+i}, r_{t+i}, s_{t+i+1})\}_{i=0}^{m-1}. \]
  • \(m\)-step TD target:

\[ y_t = \sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i} + \gamma^m \cdot v(s_{t+m}; \mathbf{w}). \]

使用了 multi-step TD target 之后的算法的流程:

  • Observing a trajectory from time \(t\) to \(t + m - 1\).

  • TD target:

\[ y_t = \sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i} + \gamma^m \cdot v(s_{t+m}; \mathbf{w}). \]
  • TD error:

\[ \delta_t = v(s_t; \mathbf{w}) - y_t. \]
  • Update the policy network (actor) by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]
  • Update the value network (critic) by:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]
作于 2026-4-15