______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Reinforce with Baseline

Approximations to Policy Gradient

对 Policy Gradient ：

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} = \mathbb{E}_{A_t \sim \pi} \left[ \frac{\partial \ln \pi(A_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( Q_\pi(s_t, A_t) - V_\pi(s_t) \right) \right]. \]

中的 \(Q_\pi(s_t, A_t)\) 和 \(V_\pi(s_t)\) 进行近似：

Recall that \(Q_\pi(s_t, a_t) = \mathbb{E}[U_t \mid s_t, a_t]\).
Monte Carlo approximation to \(Q_\pi(s_t, a_t) \approx u_t\) (REINFORCE):
Observing the trajectory: \(s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, \cdots, s_n, a_n, r_n\).
Compute return:

\[ u_t = \sum_{i=t}^{n} \gamma^{\,i-t} \cdot r_i. \]

\(u_t\) is an unbiased estimate of \(Q_\pi(s_t, a_t)\)
Approximate \(V(s;𝛉)\) by the value network, \(v(s;\mathbf{w})\).

最终 Approximate policy gradient 如下：

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} \approx 𝐠(a_t) \approx \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]

这里一共做了三次近似：

Approximate expectation using one sample, \(a_t\). (Monte Carlo.)
Approximate \(Q_\pi(s_t, a_t)\) by \(u_t\). (Another Monte Carlo.)
Approximate \(V_\pi(s)\) by the value network, \(v(s; \mathbf{w})\).

Policy and Value Networks

上文中的 \(𝐠(a_t)\) 需要策略和价值两个网络，策略网络由于控制 agent ，价值网络作为 baseline 帮助训练策略网络。

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & \underset{n \text{ is count of actions}}{n × 1 \text{ vector}} \overset{\text{Softmax}}{→} \left\{\begin{matrix} \text{action 1, } p_1 \\ \text{action 2, } p_2 \\ ⋯ \\ \text{action } n \text{, } p_n \end{matrix}\right\} \pi(a ∣ s; 𝛉) \\ \overset{\mathrm{Dense2}}{→} & v(a ∣ s; 𝐰) \end{aligned} \right. \]

REINFORCE with Baseline

Updating the policy network

\[ \frac{\partial V_\pi(s_t)}{\partial 𝛉} \approx \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]

Update policy network by policy gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉} \cdot \left( u_t - v(s_t; \mathbf{w}) \right). \]

将 \(u_t - v(s_t; \mathbf{w})\) 记作 \(-δ_t\) ，它是价值网络的预测与真实观测的回报 \(u_t\) 之间的差。

Update policy network by policy gradient ascent:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot -δ_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]

Updating the value network

Recall \(v(s_t; \mathbf{w})\) is an approximation to \(V_\pi(s_t) = \mathbb{E}[U_t \mid s_t]\).
Prediction error: \(\delta_t = v(s_t; \mathbf{w}) - u_t\).
Gradient:

\[ \frac{\partial \, \frac{\delta_t^2}{2}}{\partial \mathbf{w}} = \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

Gradient descent:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

Summary of Algorithm

Play a game to the end and observe the trajectory:

\(s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n\).

Compute

\[ u_t = \sum_{i=t}^{n} \gamma^{\,i-t} \cdot r_i \quad \text{and} \quad \delta_t = v(s_t; \mathbf{w}) - u_t. \]

Update the policy network by:

\[ 𝛉 \leftarrow 𝛉 - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \mid s_t; 𝛉)}{\partial 𝛉}. \]

Update the value network by:

\[ \mathbf{w} \leftarrow \mathbf{w}- \alpha \cdot \delta_t \cdot \frac{\partial v(s_t; \mathbf{w})}{\partial \mathbf{w}}. \]

对于每一个 \(t\) 从 \(t = 1\) 到 \(t = n\) 都重复一遍这段程序，共计重复 \(n\) 轮更新。

作于 2026-4-13

[<<]Policy Gradient with Baseline

Advantage Actor-Critic[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Reinforce with Baseline

Approximations to Policy Gradient

Policy and Value Networks

REINFORCE with Baseline

Updating the policy network

Updating the value network

Summary of Algorithm

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Reinforce with Baseline

Approximations to Policy Gradient

Policy and Value Networks

REINFORCE with Baseline

Updating the policy network

Updating the value network

Summary of Algorithm

______

|___|_|\|/___|_/\_\