______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Experience Replay

经验回放，改进训练算法。

改进 TD Learning

原先的 TD Learing 流程：

Observe state \(s_t\) and perform action \(a_t\).
Environment provides new state \(s_{t+1}\) and reward \(r_t\).
TD target: \(y_t = r_t + \gamma \cdot \max_a Q(s_{t+1}, a; 𝐰).\)
TD error: \(\delta_t = q_t - y_t\), where \(q_t = Q(s_t, a_t; 𝐰).\)
Goal: Make \(q_t\) close to \(y_t\), for all \(t\). (Equivalently, make \(\delta_t^2\) small.)
TD learning: Find \(w\) by minimizing \(L(w) = \frac{1}{T} \sum_{t=1}^{T} \frac{\delta_t^2}{2}\)
Online gradient descent:
Observe \((s_t, a_t, r_t, s_{t+1})\) and compute \(\delta_t\).
Compute gradient:\(\mathbf{g}_t = \frac{\partial (\delta_t^2 / 2)}{\partial w} = \delta_t \cdot \frac{\partial Q(s_t, a_t; w)}{\partial w}.\)
Gradient descent: \(w \leftarrow w - \alpha \cdot \mathbf{g}_t.\)
Discard \((s_t, a_t, r_t, s_{t+1})\) after using it.

但实际上，以上 TD 算法的收敛率不高。其第一个缺点是浪费 experience 。

Experience 指的是从开始到结束的所有 transition (\(s_t, a_t, r_t, s_{t + 1}\))

先前每经过一轮学习都要丢弃 transition (\(s_t, a_t, r_t, s_{t + 1}\))，但实际上这些 experience 是可以重复利用的。

第二个缺点就是 transition 之间相关性强。Transition 之间的相关性是有害的，如果能把序列打散，消除之间的关联，则有利于把 DGN 训练的更好。

Experience Replay 可以克服以上两个缺点：即可以重复利用经验以避免浪费，也可以把序列打散以消除相关性，有助于跳出局部最小值：

A transition: \((s_t, a_t, r_t, s_{t+1})\).
Store recent \(n\) transitions in a replay buffer.
Remove old transitions so that the buffer has at most \(n\) transitions.
Buffer capacity \(n\) is a tuning hyper-parameter [1, 2].
\(n\) is typically large, e.g., \(10^5 \sim 10^6\).
The setting of \(n\) is application-specific.

TD with Experience Replay

Find \(w\) by minimizing

\[ L(w) = \frac{1}{T} \sum_{t=1}^{T} \frac{\delta_t^2}{2}. \]

Stochastic gradient descent (SGD):
Randomly sample a transition \((s_i, a_i, r_i, s_{i+1})\) from the buffer.
Compute TD error \(\delta_i\).
Stochastic gradient:

\[ \mathbf{g}_i = \frac{\partial (\frac{\delta_t^2}{2})}{\partial w} = \delta_i \cdot \frac{\partial Q(s_i, a_i; w)}{\partial w}. \]

SGD:

\[ w \leftarrow w - \alpha \cdot \mathbf{g}_i. \]

以上只用了一个 transition ，实践中通常使用 Mini-batch SGD（小批量随机梯度下降）取出多个 transition，算出多个梯度，取平均梯度来更新 \(𝐰\) 做 experience replay 。相较于单样本训练（SGD），梯度估计更稳定，有助于模型收敛。

Prioritized Experience Replay

和上文那个 experience replay 不同，prioritized experience replay 用非均匀抽样代替均匀抽样。其基本思想如下：

Not all transitions are equally important;
If a transition has high TD error \(\left | δ_t \right |\), it will be given high priority.
Use importance sampling instead of uniform sampling:
Option 1: Sampling probability \(p_t ∝ |δ_t| + ϵ\).
Option 2: Sampling probability \(p_t ∝ \frac{1}{\mathrm{rank}(t)}\).
The transitions are sorted so that \(|\delta_t|\) is in descending order.
\(\mathrm{rank}(t)\) is the rank of the \(t\)-th transition.
In sum, big \(|\delta_t|\) shall be given high priority.

不同的 transition 有不同的抽样概率，这样会导致 DQN 的预测有偏差，应当相应调整学习率，抵消掉不同抽样造成的偏差：

SGD: \(𝐰 ← 𝐰 - α ⋅ 𝐠\), where \(α\) is the learning rate.
If uniform sampling is used, \(α\) is the same for all transitions.
If importance sampling is used, \(α\) shall be adjusted according to the importance.
Scale the learning rate by \((n p_t)^{-\beta}\), where \(\beta \in (0,1)\).
If \(p_1 = \cdots = p_n = \frac{1}{n}\) (uniform sampling), the scaling factor is equal to 1.
High-importance transitions (with high \(p_t\)) have low learning rates.
In the beginning, set \(\beta\) small; increase \(\beta\) to 1 over time.

Update TD Error:

Associate each transition \((s_t, a_t, r_t, s_{t+1})\) with a TD error \(\delta_t\).
If a transition is newly collected, we do not know its \(\delta_t\).
Simply set its \(\delta_t\) to the maximum.
It has the highest priority.
Each time \((s_t, a_t, r_t, s_{t+1})\) is selected from the buffer, we update its \(\delta_t\).

总结一下就是：replay buffer 里面存了 \(n\) 条 transition ，每条 transition 都是一个四元组，除此之外每条 transition 还有 TD error \(δ_t\) 。优先经验回放要根据 transition 的重要性抽样，抽样概率 \(p\) 正比于 \(\left | δ_t \right |\) 。TD error \(δ_t\) 越大，transition 被抽到的概率越大。原本使用均匀抽样的时候，学习率固定为 \(α\) ；现在使用了不同的抽样概率，会造成 DQN 的预测有偏差。为了消除偏差，应当相应调整学习率，把学习率 \(α\) 按照抽样概率 \(p\) 来做调整，学习率变成了 \(α ⋅ (n p_t)^{-β}\) 。抽样概率 \(p_t\) 越大，学习率越小。

Transitions	Sampling Probabilities	Learning Rates
\(⋯\)	\(⋯\)	\(⋯\)
\((s_t, a_t, r_t, s_{t+1}),δ_t\)	\(p_t ∝ \lvert δ_t \rvert + ϵ\)	\(α ⋅ (n p_t)^{-β}\)
\((s_{t+1}, a_{t+1}, r_{t+1}, s_{t+2}),δ_{t+1}\)	\(p_{t+1} ∝ \lvert δ_{t+1} \rvert + ϵ\)	\(α ⋅ (n p_{t+1})^{-β}\)
\((s_{t+2}, a_{t+2}, r_{t+2}, s_{t+3}),δ_{t+2}\)	\(p_{t+2} ∝ \lvert δ_{t+2} \rvert + ϵ\)	\(α ⋅ (n p_{t+2})^{-β}\)
\(⋯\)	\(⋯\)	\(⋯\)

Big \(\lvert δ_t \rvert\) \(\Longrightarrow\) High probabiliy \(\Longrightarrow\) Small learning rate

作于 2026-4-7

[<<]Monte Carlo Algorithms

Dueling Network[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Experience Replay

改进 TD Learning

TD with Experience Replay

Prioritized Experience Replay

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Experience Replay

改进 TD Learning

TD with Experience Replay

Prioritized Experience Replay

______

|___|_|\|/___|_/\_\