Deterministic Policy Gradient

确定策略梯度,可以解决连续控制问题:

┃────────────────────────────────>     Value
┃        Policy          █            Network     ──> █
┃──────> Network ──────> █ ──────> (Parameter: 𝐰)   value
┃     (Parameter: 𝛉)   action                     q(s, a; 𝐰)
state s              a = π(s; 𝛉)
  • Use a deterministic policy network (actor): \(a = \pi(s; 𝛉)\).

  • Use a value network (critic): \(q(s, a; \mathbf{w})\).

  • The critic outputs a scalar that evaluates how good the action \(a\) is.

其中 action 的输出不是概率分布,而是一个具体的动作 \(a\) 。给定 state \(s\) 输出的动作 \(a\) 是确定的,没有随机性。此输出可以是实数,也可以是向量,向量的维度数等于动作的自由度(比如一个机臂有两个关节,每个关节都可以在一定范围内活动,自由度就是 2 ,输出向量维度就为 2 。二维的意思不是说动作空间里只有两个动作,事实上动作空间里有无穷多个动作)。

Updating Value Network by TD

  • Transition: \((s_t, a_t, r_t, s_{t+1})\).

  • Value network makes prediction for time \(t\):

\[ q_t = q(s_t, a_t; \mathbf{w}). \]
  • Value network makes prediction for time \(t+1\):

\[ q_{t+1} = q(s_{t+1}, a'_{t+1}; \mathbf{w}), \quad \text{where } a'_{t+1} = \pi(s_{t+1}; 𝛉). \]
  • TD error:

\[ \delta_t = q_t - \left( r_t + \gamma \cdot q_{t+1} \right). \]
  • Update:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial q(s_t, a_t; \mathbf{w})}{\partial \mathbf{w}}. \]

注意 \(a'\) 不是 agent 真正执行的动作,只是为了更新价值网络而计算出来的。

Updating Policy Network by DPG

  • The critic \(q(s, a; \mathbf{w})\) evaluates how good the action \(a\) is.

  • Improve \(𝛉\) so that the critic believes \(a = \pi(s; 𝛉)\) is better.

  • Update \(𝛉\) so that \(q(s, a; \mathbf{w}) = q(s, \pi(s; 𝛉); \mathbf{w})\) increases.

训练流程:

  • Goal: Increasing \(q(s, a; \mathbf{w})\), where \(a = \pi(s; 𝛉)\).

  • DPG:

\[ \mathbf{g} = \frac{\partial \, q(s, \pi(s; 𝛉); \mathbf{w})}{\partial 𝛉} = \frac{\partial a}{\partial 𝛉} \cdot \frac{\partial q(s, a; \mathbf{w})}{\partial a}. \]
  • Gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \mathbf{g}. \]

「DPG」那一步其实就是让梯度从 value \(q\) 传播到 action \(a\) 然后再从 \(a\) 传播到策略网络,其中算出的梯度 \(\mathbf{g}\) 就是确定策略梯度,用它来更新策略网络。

Improvement: Using Target Networks

只凭上文的方法训练价值网络,效果不是太好,可以用一些技巧来改进,比如用 target networks :

  • Value network makes a prediction for time \(t\):

\[ q_t = q(s_t, a_t; \mathbf{w}). \]
  • Target networks make a prediction for time \(t+1\):

\[ q_{t+1} = q(s_{t+1}, a'_{t+1}; \mathbf{w}^-), \quad \text{where } a'_{t+1} = \pi(s_{t+1}; 𝛉^-). \]

这里原先用价值网络和策略网络来计算 \(t + 1\) 时刻的价值,现在改用不同的两个神经网络。\(\pi(s_{t+1}; 𝛉^-)\) 是 target policy network 用于替代策略网络,其结构与策略网络相同,但参数不同,此处参数为 \(𝛉 ^ -\) ;\(q(s_{t+1}, a'_{t+1}; \mathbf{w}^-)\) 是 target value network ,也是结构与价值网络相同,但参数不同。

两个神经网络的训练流程总结如下:

  • Policy network makes a decision: \(a = \pi(s; 𝛉)\).

  • Update policy network by DPG:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial a}{\partial 𝛉} \cdot \frac{\partial q(s, a; \mathbf{w})}{\partial a}. \]
  • Value network computes:

\[ q_t = q(s, a; \mathbf{w}). \]
  • Target networks, \(\pi(s; 𝛉^-)\) and \(q(s, a; \mathbf{w}^-)\), compute \(q_{t+1}\).

  • TD error:

\[ \delta_t = q_t - \left( r_t + \gamma \cdot q_{t+1} \right). \]
  • Update value network by TD:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial q(s, a; \mathbf{w})}{\partial \mathbf{w}}. \]

Updating Target Networks

  • Set a hyper-parameter \(\tau \in (0, 1)\).

  • Update the target networks by weighted averaging:

\[ \begin{aligned} \mathbf{w}^- \leftarrow \tau \cdot \mathbf{w} + (1 - \tau) \cdot \mathbf{w}^-, \\ 𝛉^- \leftarrow \tau \cdot 𝛉 + (1 - \tau) \cdot 𝛉^-. \end{aligned} \]

用 target networks 算出的 TD target 还是跟策略网络和价值网络有关,因此用 target networks 也不能完全避免 boost strapping 仍然可能出现偏差。

除了 target networks 之外,前述的改进 DQN 训练的方法 experience replay 和 multi-step TD target 也可以用于改进 DPG 的训练。

Stochastic Policy 与 Deterministic Policy 之间的对比

Stochastic Policy Deterministic Policy
Policy \(\pi(a \mid s; 𝛉)\) \(\pi(s; 𝛉)\)
Output Probability distribution over actions Action \(a\)
Control Randomly sample an action Directly use output \(a\)
Application Mostly discrete control Continuous control
作于 2026-4-16