______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

实例：Alpha Go 下围棋

Training in 3 steps:

Initialize policy network using behavior cloning.

(Supervised learning from human experience.)

Train the policy network using policy gradient.

(Two policy networks play against each other.)

After training the policy network, use it to train a value network.

Execution (actually play Go games):
Do Monte Carlo Tree Search (MCTS) using the policy and value networks.

Policy Network

state \(s\)

Alpha Go Zero 以 19 × 19 × 17 的 tensor 表示棋盘的状态。19 × 19 为棋盘大小，17 层的作用分别为：

层	作用
1	此时黑子的位置（1 if 黑子在此处，0 if 黑子不在此处）
2 ~ 8	_ and for the previous 7 time periods
8	此时白子的位置（1 if 白子在此处，0 if 白子不在此处）
9 ~ 16	_ and for the previous 7 time periods
17	All 1 if 要下黑子，All 0 if 要下白子

Policy Network design:

\[ \underset{\mathrm{state}}{19 × 19 × 17~\mathrm{tensor}} \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \overset{\mathrm{Dense}}{→} \underset{ \begin{matrix} \mathrm{Probability~distribution} \\ \mathrm{over~the~361~actions} \end{matrix} }{ \left\{\begin{matrix} π(1 ∣ s, 𝛉) \\ π(2 ∣ s, 𝛉) \\ π(3 ∣ s, 𝛉) \\ ⋯ \\ π(359 ∣ s, 𝛉) \\ π(360 ∣ s, 𝛉) \\ π(361 ∣ s, 𝛉) \end{matrix}\right. } \]

而 2016 年的 Alpha Go 则用 19 × 19 × 48 的 tensor 表示棋盘的状态。当时的 Policy Network 设计如下（没有全连接层）：
\[ \underset{\mathrm{state}}{19 × 19 × 48~\mathrm{tensor}} \overset{\mathrm{Conv}}{→} \underset{ \begin{matrix} \mathrm{Probability~distribution} \\ \mathrm{over~the~361~actions} \end{matrix} }{ \left\{\begin{matrix} π(1 ∣ s, 𝛉) \\ π(2 ∣ s, 𝛉) \\ π(3 ∣ s, 𝛉) \\ ⋯ \\ π(359 ∣ s, 𝛉) \\ π(360 ∣ s, 𝛉) \\ π(361 ∣ s, 𝛉) \end{matrix}\right. } \]

Behavior Cloning

最初训练的时候，神经网络的参数都是完全随机的，需要随即摸索很多次才能做出合理的动作。若一上来就用策略梯度算法来学习策略网络，需要花非常久的时间。因而 Alpha Go 利用人类的知识来学习初步的策略网络，即 Behavior Cloning 。

Behavior Cloning 是 2016 年的 Alpha Go 训练 Policy Network 的方法，其没有 Reward ，不是强化学习。Alpha Go Zero 没有经过 behavior cloning 。

Behavior Learning 是一种 Imitation learning (Supervision is from experts' actions)，模仿学习 。在这一过程中，agent 看不到奖励，而是简单地模仿 expets' actions 。Behavior Learning is simply classification or regression.

训练流程：

Observe this state \(s_t\).
Policy network makes a prediction:

\(\mathbf{p}_t = [π(1 ∣ s_t, 𝛉), ⋯, π(361 ∣ s_t, 𝛉)] ∈ (0,1)^{361}\).

The expert’s action is \(a_t^\star = 281\).
Let \(\mathbf{y}_t \in \{0,1\}^{361}\) be the one-hot encode of \(a_t^\star = 281\).
\(\mathrm{Loss} = \mathrm{CrossEntropy}(\mathbf{y}_t, \mathbf{p}_t)\).
Use gradient descent to update policy network.

After behavior cloning:

Suppose the current state \(s_t\) has appeared in training data.
The policy network imitates expert’s action \(a_t\). (Which is a good action!)

但是，如果 state \(s_t\) 没有出现在训练的数据集里，推理的效果（发起的 action 的质量）就会变差，而且错误会不断累加，使得发起的动作越来越奇怪，最终使 AI 输掉棋局。围棋中可能的 state 数量极其庞大，当前 state 未出现在数据集的概率很高，导致 behavior cloning 之后的策略网络质量并不高。因此，在 behavior cloning 之后，还要进行 RL 。

Behavior + RL beats behavior cloning with 80% chance.

RL

\[ \underset{ \begin{matrix} \mathrm{policy network with} \\ \mathrm{lastest param} \end{matrix} }{ \begin{matrix} \mathrm{Player} \\ (\mathrm{Agent}) \end{matrix} } \space\space\space\mathrm{V.S.}\space\space\space\space \underset{ \begin{matrix} \mathrm{policy network with} \\ \mathrm{old param} \end{matrix} }{ \begin{matrix} \mathrm{Opponent} \\ (\mathrm{Environment}) \end{matrix} } \]

Player’s parameters are the latest parameters of the policy network.
Opponent’s parameters are randomly selected from previous iterations.

RL 与 behavior cloning 最大的不同就是前者有 reward ，reward 定义方式如下：

Suppose a game ends at step \(T\).
Rewards:
\(r_1 = r_2 = r_3 = \cdots = r_{T-1} = 0\). (When the game has not ended.)
\(r_T = +1\) (winner).
\(r_T = -1\) (loser).
Recall that return is defined by \(u_t = ∑\limits_{i=t}^{T} r_i\). (No discount here.)
Winner’s returns: \(u_1 = u_2 = \cdots = u_T = +1\).
Loser’s returns: \(u_1 = u_2 = \cdots = u_T = -1\).

在下棋过程中，无法确定哪一步是「好棋」，哪一步是「臭棋」，所以只能采用「一刀切」的方式：若 AI 赢了这局，则这局的所有步骤都是「好棋」；若 AI 输了这局，则这局的所有步骤都是「臭棋」。

训练流程：

Two policy networks play a game to the end. (Player v.s. Opponent.)
Get a trajectory: \(s_1, a_1, s_2, a_2, ⋯, s_T, a_T\).
After the game ends, update the player’s policy network.
The player’s returns: \(u_1 = u_2 = ⋯ = u_T\). (Either \(+1\) or \(-1\).)
Sum of approximate policy gradients: \(𝐠_θ = ∑\limits_{t=1}^{T} \frac{𝜕 \log π(a_t ∣ s_t, 𝛉)}{𝜕 𝛉} \cdot u_t\).
Update policy network: \(𝛉 ← 𝛉 + β ⋅ 𝐠_θ\).

此后，策略网络 \(π\) 训练完毕。当要下棋的时候：

Observing the current state \(s_t\), randomly sample action

\[ a_t ∼ π(⋅ ∣ s_t, 𝛉) \]

但是：

The learned policy network \(π\) is strong, but not strong enough.
A small mistake may change the game result.

蒙特卡洛树搜索（ Monte Carlo Tree Search, MCTS ）（也会用到这个策略网络）比单纯使用这个策略网络的效果还要好。为了做 MCTS ，还需要一个 Value Network 。

Value Network

这里的 Value Network 不是对 \(Q_π\) 而是对 State-value function \(V_π\) 的近似。

和旧版本的 Alpha Go 不同，Alpha Go Zero 的策略网络和价值网络共用了一些卷积层：

\[ \underset{\mathrm{state}}{19 × 19 × 17~\mathrm{tensor}} \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} → \left\{\begin{matrix} \overset{ \begin{matrix}\mathrm{Policy} \\ \mathrm{Head} \end{matrix} }{→} & { \left\{\begin{matrix} π(1 ∣ s, 𝛉) \\ ⋯ \\ π(361 ∣ s, 𝛉) \end{matrix}\right. } \\ \overset{ \begin{matrix}\mathrm{Value} \\ \mathrm{Head} \end{matrix} }{→} & \mathrm{\underset{(scalar)}{State~Value}} \end{matrix}\right. \]

价值网络 \(V\) 是靠策略网络 \(π\) 的帮助训练的，所以要先训练策略网络 \(π\) 后训练价值网络 \(V\) ，因而这并不算是 action-critic method（两者同时训练）。

价值网络的训练流程：

Play a game to the end.（通过 Player 和 Opponent 两个网络互相博弈。）

If win, let \(u_1 = u_2 = ⋯ = u_T = +1\).
If lose, let \(u_1 = u_2 = ⋯ = u_T = -1\).

Loss: \(L = ∑\limits_{t=1}^{T} \frac{1}{2} \left[ v(s_t; 𝐰) - u_t \right]^2\).
Update: \(𝐰 ← 𝐰 - α ⋅ \frac{𝜕 L}{𝜕 𝐰}\).

Alpha Go 的训练就此结束。

通过 MCTS 进行决策

人类在下围棋的时候都会思考未来的状态，也就是「向前看」。同理，AI 下棋的时候也要「向前看」。AI 就是通过 MCTS 「向前看」的，MCTS 的主要思想如下：

随机选取一个 action \(a\) ；

但是暴力搜索未来的每一种情形显然是不可能的，因此需要通过策略网络丢弃「不好」的 action 。

Look ahead and see whether \(a\) leads to win or lose;
Repeat this procedure many times;
Choose the action \(a\) that has the highest score.

Every simulation of Monte Carlo Tree Search (MCTS) has 4 steps:

Selection: The player makes an action \(a\). (Imaginary action; not actual move.)
Expansion: The opponent makes an action; the state updates. (Also imaginary action; made by the policy network.)
Evaluation: Evaluate the state-value and get score \(v\). Play the game to the end to receive reward \(r\). Assign score \(\frac{v + r}{2}\) to action \(a\).
Backup: Use the score \(\frac{v + r}{2}\) to update action-values.

Selection

已观测到 state \(s_t\) ，下一步该发起什么 action ？

First, for all the valid actions \(a\), calculate the score:

\[ \mathrm{score}(a) = Q(a) + η ⋅ \frac{\pi(a ∣ s_t, 𝛉)}{1 + N(a)} \]

\(Q(a)\): Action-value computed by MCTS. (To be defined.)
\(π(a ∣ s_t, 𝛉)\): The learned policy network.
\(N(a)\): Given \(s_t\), how many times we have selected \(a\) so far.

Second, the action with the biggest \(\mathrm{score}(a)\) is selected.

比如：

\[ s_t → \left\{\begin{matrix} \mathrm{score}(a_1) = 0.8 \\ \mathrm{score}(a_2) = 0.3 \\ \mathrm{score}(a_3) = 0.6 \end{matrix}\right. \]

0.8 最高，因此选择 action \(a_1\) 。

Expansion

在发起选中的 action 以后，opponent 的 action 将是什么？

Given \(a_t\), the opponent’s action \(a'_t\) will lead to the new state \(s_{t+1}\).
The opponent’s action is randomly sampled from \(a'_t ∼ π(⋅ ∣ s'_t; 𝛉)\).

Here, \(s'_t\) is the state observed by the opponent.

- The state-transition probability \(p(s_{t+1} ∣ s_t, a_t)\) is unknown.
- Use the policy function \(\pi\) as the state-transition function \(p\).

比如：

\[ s_t → \left\{\begin{matrix} a_t & → & \left\{\begin{matrix} \overset{\mathrm{Prob} = 0.6}{→} & \mathrm{state~1} \\ \overset{\mathrm{Prob} = 0.4}{→} & \mathrm{state~2} \end{matrix}\right. \\ - \\ - \end{matrix}\right. \]

在产生 \(\mathrm{state~1}, \mathrm{state~2}\) 的两个 action 中随机抽样一个，state 变化为 \(\mathrm{state~2}\) 作为 \(s_{t + 1}\) ：

\[ s_t → \left\{\begin{matrix} a_t & → & \left\{\begin{matrix} - \\ s_{t + 1} \end{matrix}\right. \\ - \\ - \end{matrix}\right. \]

Evaluation

Run a rollout to the end of the game (step \(T\)).

Player’s action: \(a_k ∼ π(⋅ ∣ s_k; 𝛉)\).
Opponent’s action: \(a'_k ∼ π(⋅ ∣ s'_k; 𝛉)\).
Receive reward \(r_T\) at the end.
Win: \(r_T = +1\).
Lose: \(r_T = -1\).

\[ s_t → \left\{\begin{matrix} a_t & → & \left\{\begin{matrix} - \\ s_{t + 1} & \underset{\mathrm{Fast~Rollout}}{\underbrace{\overset{\mathrm{Player}}{→} a_{t + 1} \overset{\mathrm{Opponent}}{→} s_{t + 2} \overset{\mathrm{Player}}{→} ⋯ \overset{\mathrm{Opponent}}{→} s_T}} \end{matrix}\right. \\ - \\ - \end{matrix}\right. \]

除了用 reward 评价 state ，还要用价值网络来评价 \(s_{t + 1}\) ：

\(v(s_{t + 1}; 𝐰)\): output of the value network.

\[ s_t → \left\{\begin{matrix} a_t & → & \left\{\begin{matrix} - \\ s_{t + 1} & ⟹ \mathrm{Record}~V(s_{t+1}) \end{matrix}\right. \\ - \\ - \end{matrix}\right. \]

其中：

\[ V(s_{t+1}) = \frac{1}{2} v(s_{t+1}; 𝐰) + \frac{1}{2} r_T. \]

Backup

MCTS repeats such a simulation many times.
Each child of \(a_t\) has multiple recorded \(V(s_{t+1})\).
Update action-value: \(Q(a_t) = \mathrm{mean}(\text{the recorded } V\text{'s})\) (Initially \(Q(a) = 0\)).
The \(Q\) values will be used in Step 1 (selection).

\[ s_t → \left\{\begin{matrix} Q(a_1) \\ Q(a_2) \\ Q(a_3) \end{matrix}\right. \]

由式 \(\mathrm{score}(a) = Q(a) + η ⋅ \frac{\pi(a ∣ s_t, 𝛉)}{1 + N(a)}\) 可知，模拟最开始的时候，\(\mathrm{score}(a)\) 几乎完全由策略网络决定。随着模拟次数的增多，\(N(a)\) 不断增大，策略网络的权重降低，\(Q(a)\) 的权重增加。

决策

\(N(a)\): How many times \(a\) has been selected so far.
After MCTS, the player makes actual decision:

\[ a_t = \arg\max_a N(a). \]

总之，Alpha Go 通过 MCTS 进行决策的总流程如下：

MCTS has 4 steps: selection, expansion, evaluation, and backup.
To perform one action, AlphaGo repeats the 4 steps for many times to calculate \(Q(a)\) and \(N(a)\) (for every action \(a\)).
AlphaGo executes the action \(a\) with the highest \(N(a)\) value.
To perform the next action, AlphaGo does MCTS all over again.(Initialize \(Q(a)\) and \(N(a)\) to zeros.)

Alpha Go Zero V.S. Alpha Go

AlphaGo Zero is stronger than AlphaGo. (100–0 against AlphaGo.)
Differences:
AlphaGo Zero does not use human experience. (No behavior cloning.)
MCTS is used to train the policy network.

Alpha Go Zero 没有人类下棋的经验，但效果比 Alpha Go 更好，证明人类的下棋经验对 AI 下棋训练反而是有害的。

但并不是说 behavior cloning 完全没用。比如手术机器人，如果沿用和 Alpha Go Zero 一样的训练方案，就需要拿真实的患者开刀，这可能会导致患者死亡，使训练的代价过于高昂；而 behavior cloning 的话，AI 通过模拟仿真进行初步的训练，就可以避免这种代价。

Alpha Go Zero 训练策略网络的方式

Observe state \(s_t\).
Predictions made by policy network:

\[ 𝐩 = [π(a = 1 ∣ s_t; 𝛉), ⋯, π(a = 361 ∣ s_t; 𝛉)] \in ℝ^{361}. \]

Predictions made by MCTS:

\[ 𝐧 = \mathrm{normalize}\,[N(a = 1), N(a = 2), ⋯, N(a = 361)] \in ℝ^{361}. \]

Loss:

\[ L = \mathrm{CrossEntropy}(𝐧, 𝐩). \]

Use \(\frac{𝜕 L}{𝜕 𝛉}\) to update \(𝛉\).

作于 2026-4-5

[<<]Actor-Critic Method

Monte Carlo Algorithms[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

实例：Alpha Go 下围棋

Policy Network

state \(s\)

Behavior Cloning

RL

Value Network

通过 MCTS 进行决策

Selection

Expansion

Evaluation

Backup

决策

Alpha Go Zero V.S. Alpha Go

Alpha Go Zero 训练策略网络的方式

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

实例：Alpha Go 下围棋

Policy Network

state \(s\)

Behavior Cloning

RL

Value Network

通过 MCTS 进行决策

Selection

Expansion

Evaluation

Backup

决策

Alpha Go Zero V.S. Alpha Go

Alpha Go Zero 训练策略网络的方式

______

|___|_|\|/___|_/\_\