______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Advantage Actor-Critic

两个神经网络：

Policy network (actor): \(\pi(a \mid s; 𝛉)\)
It is an approximation to the policy function, \(\pi(a \mid s)\).
It controls the agent.
Value network (critic): \(v(s; \mathbf{w})\)
It is an approximation to the state-value function, \(V_\pi(s)\).
It evaluates how good the state \(s\) is.

整体结构和Reinforce with Baseline的相同：

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & \underset{n \text{ is count of actions}}{n × 1 \text{ vector}} \overset{\text{Softmax}}{→} \left\{\begin{matrix} \text{action 1, } p_1 \\ \text{action 2, } p_2 \\ ⋯ \\ \text{action } n \text{, } p_n \end{matrix}\right\} \pi(a ∣ s; 𝛉) \\ \overset{\mathrm{Dense2}}{→} & v(a ∣ s; 𝐰) \end{aligned} \right. \]

A2C 的 \(v(a ∣ s; 𝐰)\) 充当 critic ，而 REINFORCE 的只充当 baseline ，不会评价动作的好坏。

二者区别在于训练上。A2C 的训练流程如下：

Observe a transition \((s_t, a_t, r_t, s_{t+1})\).
TD target:

\[ y_t = r_t + \gamma \cdot v(s_{t+1}; \mathbf{w}). \]

TD error:

\[ \delta_t = v(s_t; \mathbf{w}) - y_t. \]

Update the policy network (actor) by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]

Update the value network (critic) by:

\[ \mathbf{w} \leftarrow \mathbf{w}- \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

A2C with Multi-Step TD Target

上文 A2C 的训练流程用的是 one-step TD target（因为 TD Target 只包含了一个奖励 \(r_t\) ），这里可以换用 multi-step TD target 以改善效果。 one-step TD target V.S. multi-step TD target:

Observing a transition \((s_t, a_t, r_t, s_{t+1})\).
One-step TD target:

\[ y_t = r_t + \gamma \cdot v(s_{t+1}; \mathbf{w}). \]

Observing \(m\) transitions:

\[ \{(s_{t+i}, a_{t+i}, r_{t+i}, s_{t+i+1})\}_{i=0}^{m-1}. \]

\(m\)-step TD target:

\[ y_t = \sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i} + \gamma^m \cdot v(s_{t+m}; \mathbf{w}). \]

使用了 multi-step TD target 之后的算法的流程：

Observing a trajectory from time \(t\) to \(t + m - 1\).
TD target:

\[ y_t = \sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i} + \gamma^m \cdot v(s_{t+m}; \mathbf{w}). \]

TD error:

\[ \delta_t = v(s_t; \mathbf{w}) - y_t. \]

Update the policy network (actor) by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]

Update the value network (critic) by:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

作于 2026-4-15

[<<]Reinforce with Baseline

Deterministic Policy Gradient[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Advantage Actor-Critic

A2C with Multi-Step TD Target

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Advantage Actor-Critic

A2C with Multi-Step TD Target

______

|___|_|\|/___|_/\_\