______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Multi-Agent 强化学习

Setting

Multi-agent 有常见以下 4 种关系设定：

Fully cooperative: Agent collaborate to optimize a common return.
Fully competitive: One agent's gain is the other agent's loss.
Mixed cooperative & competitive: There are both cooperative setting and competitive setting.
Self-interested: Agents are self-interested. Their rewards may or may not conflict.

专业术语

State, Action, State Transition

There are \(n\) agents.
Let \(S\) be the state.
Let \(A^i\) be the \(i\)-th agent's action.
State transition:

\[ p(s' ∣ s, a^1, ⋯, a^n) = ℙ(S' = s' ∣ S = s, A^1 = a^1, ⋯, A^n = a^n). \]

The next state \(S'\) depends on all the agents' actions.

Rewards

Let \(R^i\) be the reward received by the \(i\)-th agent.
Fully cooperative: \(R^1 = R^2 = ⋯ = R^n\)
Fully competitive: \(R^1 ∝ -R^2\)
\(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\{A^j\}_{j ≠ i}\).

Returns

Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\).
Return (of the \(i\)-th agent):

\[ U_t^i = R_t^i + R_{t+1}^i + R_{t+2}^i + R_{t+3}^i + ⋯ \]

Discounted return (of the \(i\)-th agent):

\[ U_t^i = R_t^i + γ R_{t+1}^i + γ^2 R_{t+2}^i + γ^3 R_{t+3}^i + ⋯ \]

Here, \(γ ∈ [0,1]\) is the discount rate.

Policy Network

Each agent has its own policy network: \(π(a^i ∣ s; 𝛉^i).\)
Policy networks can be exchangeable: \(𝛉^1 = 𝛉^2 = ⋯ = 𝛉^n\)
Self-driving cars can have the same policy.
Policy networks can be nonexchangeable: \(𝛉^i ≠ 𝛉^j.\)
Soccer players have different roles, e.g., striker, defender, goalkeeper.

Uncertainty in the Return

The reward \(R_t^i\) depends on \(S_t\) and \(A_t^1, A_t^2, ⋯, A_t^n\).
Uncertainty in \(S_t\) is from the state transition, \(p\).
Uncertainty in \(A_t^i\) is from the policy network: \(π(⋅ ∣ S_t; 𝛉^i)\)
The return: \(U_t^i = ∑\limits_{k=0}^{∞} γ^k ⋅ R_{t+k}^i\) depends on:
all the future states: \(\{S_t, S_{t+1}, S_{t+2}, ⋯\}\);
all the future actions: \(\{A_t^i, A_{t+1}^i, A_{t+2}^i, ⋯\}\), for all \(i = 1, ⋯, n\).

State-Value Function

State-value of the \(i\)-th agent:

\[ V^i(s_t; 𝛉^1, ⋯, 𝛉^n) = \mathbb{E}\big[U_t^i \mid S_t = s_t\big]. \]

The expectation is taken w.r.t. all the future actions and states except \(S_t\).
Randomness in actions: \(A_t^j \sim \pi(\cdot \mid s_t; 𝛉^j)\), for all \(j = 1, ⋯, n\).

(That is why the state-value \(V^i\) depends on \(𝛉^1, ⋯, 𝛉^n\).)

One agent’s state-value, \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\), depends on all the agents’ policies.
If any agent changes its policy, then all of \(V^1, ⋯, V^n\) can change.
Example: soccer game.
A striker improves his policy, while everyone else’s policies are fixed.
His teammates’ state-values all increase.
The opposing players’ state-values all decrease.

Convergence

Convergence（收敛）：无法通过改进策略来获得更大的期望回报。如果所有 agents 都找不到更好的策略，就说明已经收敛，可以中止训练了。

回顾一下 single-agent policy learning ：

Policy network: \(\pi(a ∣ s; 𝛉)\).
State-value function: \(V(s; 𝛉)\).
\(J(𝛉) = 𝔼_S \left[ V(S; 𝛉) \right]\) evaluates how good the policy is.
Learn the policy network’s parameter, \(𝛉\), by \(\max\limits_{𝛉} \; J(𝛉)\).
Convergence: \(J(𝛉)\) stops increasing.

如果有多个 agents ，那么判断收敛的标准就是纳什均衡（ Nash Equilibrium ）：

While all the other agents’ policy remain the same, the \(i\)-th agent cannot get better expected return by changing its own policy.
Every agent is playing a best-response to the other agents’ policies.
Nash equilibrium indicates convergence because no one has any incentive to deviate.

Difficulty of MARL

多个 agents 会令训练变得更困难，直接套用 single-agent 的算法效果不好，可能不会收敛，因为：

The \(i\)-th agent’s policy network: \(\pi(a^i \mid s; 𝛉^i)\).
The \(i\)-th agent’s state-value function: \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\).
Objective function:

\[ J^i(𝛉^1, ⋯, 𝛉^n) = 𝔼_S \left[ V^i(S; 𝛉^1, ⋯, 𝛉^n) \right] \]

Learn the policy network’s parameter, \(𝛉^i\), by

\[ \max_{𝛉^i} \; J^i(𝛉^1, ⋯, 𝛉^n) \]

注意到不同的 agent 的目标 \(\max_{𝛉^i} \; J^i(𝛉^1, ⋯, 𝛉^n)\) 并不相同：

\[ \begin{matrix} \\ \text{The 1st agent solves:} & \max\limits_{𝛉^1} \; J^1(\underline{𝛉^1}, 𝛉^2, ⋯, 𝛉^n). \\ \text{The 2nd agent solves:} & \max\limits_{𝛉^2} \; J^2(𝛉^1, \underline{𝛉^2}, ⋯, 𝛉^n). \\ \vdots \\ \text{The }n\text{th agent solves:} & \max\limits_{𝛉^n} \; J^n(𝛉^1, 𝛉^2, ⋯, \underline{𝛉^n}). \end{matrix} \]

这些 agents 各自有各自的目标函数，没有共同的目标，各自更新各自的参数。

这可能导致永远无法收敛。因为一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化，例如：

The \(i\)-th agent found \(𝛉_\star^i = \arg\max_{𝛉^i} \; J^i(𝛉^1, \cdots, 𝛉^n)\).
Now, another agent changes its policy.
So \(𝛉_\star^i\) is no longer the best policy of the \(i\)-th agent. The \(i\)-th agent has to find a new \(𝛉^i\).
The other agents’ objective functions will change, and therefore they will change their policies...

就这样所有的 agents 都在不停地改变自己的策略，可能永远都无法收敛。

Centralized and Decentralized

由于一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化，因而需要令 agents 之间能够进行通信。通信方式有两种：Centralized ，中心化，和 Decentralized ，去中心化。

Fully decentralized: Every agent uses its own observations and rewards to learn its policy. Agents do not communicate.
Fully centralized: The agents send everything to the central controller. The controller makes decisions for all the agents.

Agent 本身只负责发起动作，不负责决策；决策由 central controller 负责。

Centralized training with decentralized execution: A central controller is used during training. The controller is disabled after training.

Partial Observations

MARL 通常假设 Partial Observations ，部分观测，这是因为 agent 往往只能观测到局部的 state 而看不到全局的 state 。

An agent may or may not have full knowledge of the state, \(s\).
Let \(o^i\) be the \(i\)-th agent’s observation.
Partial observation: \(o^i \ne s\).
Full observation: \(o^1 = ⋯ = o^n = s\).

Fully Decentralized

训练过程：（和 single-agent RL 的训练过程完全一样）

\[ \begin{matrix} \begin{matrix} \\ \text{Agent} 1 & \text{Agent} 2 & ⋯ & \text{Agent} n \\a^1 ↑↓ o^1, r^1 & a^2 ↑↓ o^2, r^2 & & a^n ↑↓ o^n, r^n \end{matrix} \\ Environment \end{matrix} \]

训练结束之后，每个 agent 使用各自的策略网络来做决策。无论是训练还是推理，agent 之间

\[ \begin{matrix} \begin{matrix} \\ a ^ 1 ∼ π(⋅ ∣ o ^ 1; 𝛉 ^ 1) & a ^ 2 ∼ π(⋅ ∣ o ^ 2; 𝛉 ^ 2) & & a ^ n ∼ π(⋅ ∣ o ^ n; 𝛉 ^ n) \\ \text{Agent} 1 & \text{Agent} 2 & ⋯ & \text{Agent} n \\a^1 ↑↓ o^1 & a^2 ↑↓ o^2 & & a^n ↑↓ o^n \end{matrix} \\ Environment \end{matrix} \]

The \(i\)-th agent has a policy network (actor): \(\pi(a^i ∣ o^i; 𝛉^i)\).
The \(i\)-th agent has a value network (critic): \(q(o^i, a^i; \mathbf{w}^i)\).
Agents do not share observations and actions.
Train the policy and value networks in the same way as the single-agent setting.
This does not work well.

Fully Centralized

Centralized Training:

\[ \text{Central Controller} \left\{ \begin{matrix} \underset{a^1}{\overset{o^1, r^1}{⇆}} & \text{Agent} 1 ↔ \\ \underset{a^2}{\overset{o^2, r^2}{⇆}} & \text{Agent} 2 ↔ \\ ⋯ \\ \underset{a^n}{\overset{o^n, r^n}{⇆}} & \text{Agent} n ↔ \end{matrix} \right\} \text{Environment} \]

Centralized Execution:

\[ \overset{π(a ^ i ∣ o ^ 1, ⋯, o ^ n; 𝛉 ^ i) \text{ for all } i = 1, 2, ⋯, n} {\text{Central Controller}} \left\{ \begin{matrix} \underset{a^1}{\overset{o^1, r^1}{⇆}} & \text{Agent} 1 ↔ \\ \underset{a^2}{\overset{o^2, r^2}{⇆}} & \text{Agent} 2 ↔ \\ ⋯ \\ \underset{a^n}{\overset{o^n, r^n}{⇆}} & \text{Agent} n ↔ \end{matrix} \right\} \text{Environment} \]

比如 Centralized 应用于 actor-critic method ：

Let \(𝐚 = [a^1, a^2, ⋯, a^n]\) contain all the agents’ actions.
Let \(𝐨 = [o^1, o^2, ⋯, o^n]\) contain all the agents’ observations.
The central controller knows \(𝐚\), \(𝐨\), and all the rewards.
The controller has \(n\) policy networks and \(n\) value networks:
Policy network (actor) for the \(i\)-th agent: \(\pi(a^i ∣ 𝐨; 𝛉^i)\).
Value network (critic) for the \(i\)-th agent: \(q(𝐨, 𝐚; 𝐰^i)\).
Centralized Training: Training is performed by the controller.
The controller knows all the observations, actions, and rewards.
Train \(\pi(a^i ∣ 𝐨; 𝛉^i)\) using policy gradient.
Train \(q(𝐨, 𝐚; 𝐰^i)\) using TD algorithm.
Centralized Execution: Decisions are made by the controller.
For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.
The controller knows \(𝐨 = [o^1, o^2, ⋯, o^n]\).
For all \(i\), the controller samples action by \(a^i ∼ π(⋅ ∣ 𝐨; 𝛉^i)\) and sends \(a^i\) to the \(i\)-th agent.

Fully Centralized 的缺点是执行速度慢：

All the agents send their observations to the central controller.
The central controller makes decisions, \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\), and sends \(a^i\) to the \(i\)-th agent.

王树森讲到这里的时候说了一句「中央亲自部署，亲自指挥……」，因此这一集的 AI 中文字幕没了，实在令環状線忍俊不禁。该句出现在这一集视频的 8 分 54 秒。

Communication and synchronization cost time.
Real-time decision is impossible.

Centralized Training with Decentralized Execution

比如对于 actor-critic method ：

Each agent has its own policy network (actor): \(π(a^i ∣ o^i; 𝛉^i)\).
The central controller has \(n\) value networks (critics): \(q(𝐨, 𝐚; 𝐰^i)\).
Centralized Training: During training, the central controller knows all the agents’ observations, actions, and rewards.
Decentralized Execution: During execution, the central controller and its value networks are not used.

Centralized Training:

\[ \left. \begin{matrix} \text{Critic }1 \\ \text{Critic }2 \\ ⋯ \\ \text{Critic }n \end{matrix} \right\} ← \overset{\{a ^ i, o ^ i, r ^ i\}^n_{i = 1}} {\text{Central Controller}} \left\{ \begin{matrix} \overset{a^1 ,o^1, r^1}{←} & \text{Agent} 1 & \underset{a^1}{\overset{o^1, r^1}{⇆}} \\ \overset{a^2 ,o^2, r^2}{←} & \text{Agent} 2 & \underset{a^2}{\overset{o^2, r^2}{⇆}} \\ & ⋯ \\ \overset{a^n ,o^n, r^n}{←} & \text{Agent} n & \underset{a^n}{\overset{o^n, r^n}{⇆}} \end{matrix} \right\} \text{Environment} \]

The central controller trains the critics, \(q(\mathbf{o}, \mathbf{a}; 𝐰^i)\), for all \(i\).
To update \(𝐰^i\), TD algorithm takes as inputs:
All the actions: \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\).
All the observations: \(\mathbf{o} = [o^1, o^2, \cdots, o^n]\).
The \(i\)-th reward: \(r^i\).

\[ \overset{\{a ^ i, o ^ i, r ^ i\}^n_{i = 1}} {\text{Central Controller}} → \left\{ \begin{matrix} \text{Critic }1 & \overset{q ^ 1 = q(𝐨, 𝐚; 𝐰 ^ 1)}{→} \text{Actor }1 \\ \text{Critic }2 & \overset{q ^ 2 = q(𝐨, 𝐚; 𝐰 ^ 2)}{→} \text{Actor }2 \\ ⋯ \\ \text{Critic }n & \overset{q ^ n = q(𝐨, 𝐚; 𝐰 ^ n)}{→} \text{Actor }n \end{matrix} \right. \]

Each agent locally trains the actor, \(\pi(a^i \mid o^i; 𝛉^i)\), using policy gradient.
To update \(𝛉^i\), the policy gradient algorithm takes as input \((a^i, o^i, q^i)\).

训练之后就不再需要中央控制器了，每个 agent 独立跟环境交互。

Decentralized Execution:

\[ \left. \begin{matrix} a ^ 1 ∼ π(⋅ ∣ o ^ 1; 𝛉 ^ 1) & \text{Agent} 1 & \underset{a^1}{\overset{o^1}{⇆}} \\ a ^ 2 ∼ π(⋅ ∣ o ^ 2; 𝛉 ^ 2) & \text{Agent} 2 & \underset{a^2}{\overset{o^2}{⇆}} \\ & ⋯ \\ a ^ n ∼ π(⋅ ∣ o ^ n; 𝛉 ^ n) & \text{Agent} n & \underset{a^n}{\overset{o^n}{⇆}} \end{matrix} \right\} \text{Environment} \]

Parameter Sharing

Policy networks: \(π(a^i \mid o^i; 𝛉^i)\), for \(i = 1, 2, ⋯, n\).
Value networks: \(q(𝐨, 𝐚; 𝐰^i)\), for \(i = 1, 2, ⋯, n\).
Trainable parameters: \(\{𝛉^i, 𝐰^i\}_{i=1}^n\).
Parameter sharing: \(𝛉^i = 𝛉^j\) and \(𝐰^i = 𝐰^j\), for some \(i\) and \(j\).

那么，是否应该共享参数呢？这取决于 agents 之间是否 exchangeable 。如果 exchangeable ，则可以共享参数；反之则不能。

作于 2026-4-9

[<<]Dueling Network

Policy Gradient with Baseline[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Multi-Agent 强化学习

Setting

专业术语

State, Action, State Transition

Rewards

Returns

Policy Network

Uncertainty in the Return

State-Value Function

Convergence

Difficulty of MARL

Centralized and Decentralized

Partial Observations

Fully Decentralized

Fully Centralized

Centralized Training with Decentralized Execution

Parameter Sharing

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Multi-Agent 强化学习

Setting

专业术语

State, Action, State Transition

Rewards

Returns

Policy Network

Uncertainty in the Return

State-Value Function

Convergence

Difficulty of MARL

Centralized and Decentralized

Partial Observations

Fully Decentralized

Fully Centralized

Centralized Training with Decentralized Execution

Parameter Sharing

______

|___|_|\|/___|_/\_\