______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Dueling Network

对神经网络进行改进。

Advantage Function

优质函数的定义：

\[ A^\star(s, a) = Q^\star(s, a) - V^\star(s) \]

其中，$Q^\star(s, a)$ 是 optimal action-value function ，$V^\star(s)$ 是 optimal state-value function 。

$A^\star$ 的意思是 action $a$ 相对于 baseline 的优势。action $a$ 越好，其优势越大。

由定理 (1) $$V^\star(s) = \max_a Q^\star(s, a)$$ 得：

\[ \max_a A^\star(s, a) = \max_a Q^\star(s, a) - V^\star(s) = 0 \]

即 $A^\star(s, a)$ 的最大值为 0 。

另外还可以得到定理 (2)：

\[ \begin{matrix} & A^\star(s, a) = Q^\star(s, a) - V^\star(s) \\ ⇒ & Q^\star(s, a) = V^\star(s) + A^\star(s, a) \\ ⇒ & Q^\star(s, a) = V^\star(s) + A^\star(s, a) - \underset{a}{\max} A^\star(s, a) & (2) \end{matrix} \]

Dueling Network 搭建

需要两个神经网络：

Approximate $A^\star(s, a)$ by a neural network: $A(s, a; 𝐰^A)$.
Approximate $V^\star(s)$ by a neural network: $V(s; 𝐰^V)$.

然后，将定理 (2) 替换为神经网络（ $𝐰 = (𝐰^A, 𝐰^V)$ ）：

\[ Q(s, a; 𝐰) = V(s; 𝐰^V) + A(s, a; 𝐰^A) - \max_a A(s, a; 𝐰^A) \]

这就是 dueling network ，和 DQN 有相同的作用。

\[ \mathrm{state}~s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} → \left\{\begin{matrix} \overset{\mathrm{Dense1}}{→} & { \left\{\begin{matrix} A(s, \mathrm{action1}; 𝐰^A) \\ A(s, \mathrm{action2}; 𝐰^A) \\ A(s, \mathrm{action3}; 𝐰^A) \\ … \end{matrix}\right. } \\ \overset{\mathrm{Dense2}}{→} & V(s; 𝐰^V) \end{matrix}\right\} \overset{\mathrm{ \begin{matrix} V(s; 𝐰^V) + A(s, \mathrm{action1}; 𝐰^A) - \underset{a}{\max} A(s, a; 𝐰^A) \\ V(s; 𝐰^V) + A(s, \mathrm{action2}; 𝐰^A) - \underset{a}{\max} A(s, a; 𝐰^A) \\ V(s; 𝐰^V) + A(s, \mathrm{action3}; 𝐰^A) - \underset{a}{\max} A(s, a; 𝐰^A) \\ … \end{matrix} }}{⟶} \left\{\begin{matrix} Q(s, \mathrm{action1}; 𝐰) \\ Q(s, \mathrm{action2}; 𝐰) \\ Q(s, \mathrm{action3}; 𝐰) \\ … \end{matrix} \right. \]

其中 $\underset{a}{\max} A(s, a; 𝐰^A)$ 是向量 $A(s, a; 𝐰^A)$ 中值最大的元素。

至此，Dueling Network 已搭建完毕。下一步是训练其参数 $𝐰$ ，训练方式和 DQN 可以完全相同：

Prioritized experience replay
Double DQN
Multi-step TD target

Overcome Non-identifiability

Equation 1: $Q^\star(s, a) = V^\star(s) + A^\star(s, a)$.
Equation 2: $Q^\star(s, a) = V^\star(s) + A^\star(s, a) - \max_a A^\star(s, a)$.

Equation 1 has the problem of non-identifiability:

Let $V' = V^\star + 10$ and $A' = A^\star - 10$.
Then:

\[ Q^\star(s, a) = V^\star(s) + A^\star(s, a) = V'(s) + A'(s, a) \]

也就是说，Equation 1 的 $Q^\star$ 拆成 $V^\star$ 和 $A^\star$ 时不只有唯一结果。

不唯一性的危害：如果神经网络 $V$ 和 $A$ 训练时上下波动的幅度相同但方向相反，那么 dueling network 的输出将毫无差别。但 $V$ 和 $A$ 两个神经网络上下波动，两个神经网络都不稳定，都训练不好。加上 $\underset{a}{\max} A^\star(s, a)$ 这一项有助于保持神经网络的稳定，避免这两个神经网络的输出随意上下波动。

实际中，把 $\underset{a}{\max} A(s, a; 𝐰^A)$ 换成均值 $\underset{a}{\mathrm{mean}} A(s, a; 𝐰^A)$ ，即：

\[ Q(s, a; 𝐰) = V(s; 𝐰^V) + A(s, a; 𝐰^A) - \underset{a}{\mathrm{mean}} A(s, a; 𝐰^A) \]

效果会更好。

作于 2026-4-8

[<<]Experience Replay

Multi-Agent 强化学习[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Dueling Network

Advantage Function

Dueling Network 搭建

Overcome Non-identifiability

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Dueling Network

Advantage Function

Dueling Network 搭建

Overcome Non-identifiability

______

|___|_|\|/___|_/\_\