Experience Replay

经验回放,改进训练算法。

改进 TD Learning

原先的 TD Learing 流程:

  • Observe state \(s_t\) and perform action \(a_t\).

  • Environment provides new state \(s_{t+1}\) and reward \(r_t\).

  • TD target: \(y_t = r_t + \gamma \cdot \max_a Q(s_{t+1}, a; 𝐰).\)

  • TD error: \(\delta_t = q_t - y_t\), where \(q_t = Q(s_t, a_t; 𝐰).\)

  • Goal: Make \(q_t\) close to \(y_t\), for all \(t\). (Equivalently, make \(\delta_t^2\) small.)

  • TD learning: Find \(w\) by minimizing \(L(w) = \frac{1}{T} \sum_{t=1}^{T} \frac{\delta_t^2}{2}\)

  • Online gradient descent:

  • Observe \((s_t, a_t, r_t, s_{t+1})\) and compute \(\delta_t\).

  • Compute gradient:\(\mathbf{g}_t = \frac{\partial (\delta_t^2 / 2)}{\partial w} = \delta_t \cdot \frac{\partial Q(s_t, a_t; w)}{\partial w}.\)

  • Gradient descent: \(w \leftarrow w - \alpha \cdot \mathbf{g}_t.\)

  • Discard \((s_t, a_t, r_t, s_{t+1})\) after using it.

但实际上,以上 TD 算法的收敛率不高。其第一个缺点是浪费 experience 。

Experience 指的是从开始到结束的所有 transition (\(s_t, a_t, r_t, s_{t + 1}\))

先前每经过一轮学习都要丢弃 transition (\(s_t, a_t, r_t, s_{t + 1}\)),但实际上这些 experience 是可以重复利用的。

第二个缺点就是 transition 之间相关性强。Transition 之间的相关性是有害的,如果能把序列打散,消除之间的关联,则有利于把 DGN 训练的更好。

Experience Replay 可以克服以上两个缺点:即可以重复利用经验以避免浪费,也可以把序列打散以消除相关性,有助于跳出局部最小值:

  • A transition: \((s_t, a_t, r_t, s_{t+1})\).

  • Store recent \(n\) transitions in a replay buffer.

  • Remove old transitions so that the buffer has at most \(n\) transitions.

  • Buffer capacity \(n\) is a tuning hyper-parameter [1, 2].

  • \(n\) is typically large, e.g., \(10^5 \sim 10^6\).

  • The setting of \(n\) is application-specific.

TD with Experience Replay

  • Find \(w\) by minimizing

\[ L(w) = \frac{1}{T} \sum_{t=1}^{T} \frac{\delta_t^2}{2}. \]
  • Stochastic gradient descent (SGD):

  • Randomly sample a transition \((s_i, a_i, r_i, s_{i+1})\) from the buffer.

  • Compute TD error \(\delta_i\).

  • Stochastic gradient:

\[ \mathbf{g}_i = \frac{\partial (\frac{\delta_t^2}{2})}{\partial w} = \delta_i \cdot \frac{\partial Q(s_i, a_i; w)}{\partial w}. \]
  • SGD:

\[ w \leftarrow w - \alpha \cdot \mathbf{g}_i. \]

以上只用了一个 transition ,实践中通常使用 Mini-batch SGD(小批量随机梯度下降)取出多个 transition,算出多个梯度,取平均梯度来更新 \(𝐰\) 做 experience replay 。相较于单样本训练(SGD),梯度估计更稳定,有助于模型收敛。

Prioritized Experience Replay

和上文那个 experience replay 不同,prioritized experience replay 用非均匀抽样代替均匀抽样。其基本思想如下:

  • Not all transitions are equally important;

  • If a transition has high TD error \(\left | δ_t \right |\), it will be given high priority.

  • Use importance sampling instead of uniform sampling:

  • Option 1: Sampling probability \(p_t ∝ |δ_t| + ϵ\).

  • Option 2: Sampling probability \(p_t ∝ \frac{1}{\mathrm{rank}(t)}\).

  • The transitions are sorted so that \(|\delta_t|\) is in descending order.

  • \(\mathrm{rank}(t)\) is the rank of the \(t\)-th transition.

  • In sum, big \(|\delta_t|\) shall be given high priority.

不同的 transition 有不同的抽样概率,这样会导致 DQN 的预测有偏差,应当相应调整学习率,抵消掉不同抽样造成的偏差:

  • SGD: \(𝐰 ← 𝐰 - α ⋅ 𝐠\), where \(α\) is the learning rate.

  • If uniform sampling is used, \(α\) is the same for all transitions.

  • If importance sampling is used, \(α\) shall be adjusted according to the importance.

  • Scale the learning rate by \((n p_t)^{-\beta}\), where \(\beta \in (0,1)\).

  • If \(p_1 = \cdots = p_n = \frac{1}{n}\) (uniform sampling), the scaling factor is equal to 1.

  • High-importance transitions (with high \(p_t\)) have low learning rates.

  • In the beginning, set \(\beta\) small; increase \(\beta\) to 1 over time.

Update TD Error:

  • Associate each transition \((s_t, a_t, r_t, s_{t+1})\) with a TD error \(\delta_t\).

  • If a transition is newly collected, we do not know its \(\delta_t\).

  • Simply set its \(\delta_t\) to the maximum.

  • It has the highest priority.

  • Each time \((s_t, a_t, r_t, s_{t+1})\) is selected from the buffer, we update its \(\delta_t\).

总结一下就是:replay buffer 里面存了 \(n\) 条 transition ,每条 transition 都是一个四元组,除此之外每条 transition 还有 TD error \(δ_t\) 。优先经验回放要根据 transition 的重要性抽样,抽样概率 \(p\) 正比于 \(\left | δ_t \right |\) 。TD error \(δ_t\) 越大,transition 被抽到的概率越大。原本使用均匀抽样的时候,学习率固定为 \(α\) ;现在使用了不同的抽样概率,会造成 DQN 的预测有偏差。为了消除偏差,应当相应调整学习率,把学习率 \(α\) 按照抽样概率 \(p\) 来做调整,学习率变成了 \(α ⋅ (n p_t)^{-β}\) 。抽样概率 \(p_t\) 越大,学习率越小。

Transitions Sampling Probabilities Learning Rates
\(⋯\) \(⋯\) \(⋯\)
\((s_t, a_t, r_t, s_{t+1}),δ_t\) \(p_t ∝ \lvert δ_t \rvert + ϵ\) \(α ⋅ (n p_t)^{-β}\)
\((s_{t+1}, a_{t+1}, r_{t+1}, s_{t+2}),δ_{t+1}\) \(p_{t+1} ∝ \lvert δ_{t+1} \rvert + ϵ\) \(α ⋅ (n p_{t+1})^{-β}\)
\((s_{t+2}, a_{t+2}, r_{t+2}, s_{t+3}),δ_{t+2}\) \(p_{t+2} ∝ \lvert δ_{t+2} \rvert + ϵ\) \(α ⋅ (n p_{t+2})^{-β}\)
\(⋯\) \(⋯\) \(⋯\)

Big \(\lvert δ_t \rvert\) \(\Longrightarrow\) High probabiliy \(\Longrightarrow\) Small learning rate

作于 2026-4-7