Advantage Actor-Critic
两个神经网络:
Policy network (actor): \(\pi(a \mid s; 𝛉)\)
It is an approximation to the policy function, \(\pi(a \mid s)\).
It controls the agent.
Value network (critic): \(v(s; \mathbf{w})\)
It is an approximation to the state-value function, \(V_\pi(s)\).
It evaluates how good the state \(s\) is.
整体结构和Reinforce with Baseline的相同:
A2C 的 \(v(a ∣ s; 𝐰)\) 充当 critic ,而 REINFORCE 的只充当 baseline ,不会评价动作的好坏。
二者区别在于训练上。A2C 的训练流程如下:
Observe a transition \((s_t, a_t, r_t, s_{t+1})\).
TD target:
TD error:
Update the policy network (actor) by:
Update the value network (critic) by:
A2C with Multi-Step TD Target
上文 A2C 的训练流程用的是 one-step TD target(因为 TD Target 只包含了一个奖励 \(r_t\) ),这里可以换用 multi-step TD target 以改善效果。 one-step TD target V.S. multi-step TD target:
Observing a transition \((s_t, a_t, r_t, s_{t+1})\).
One-step TD target:
Observing \(m\) transitions:
\(m\)-step TD target:
使用了 multi-step TD target 之后的算法的流程:
Observing a trajectory from time \(t\) to \(t + m - 1\).
TD target:
TD error:
Update the policy network (actor) by:
Update the value network (critic) by: