Multi-Agent 强化学习
Setting
Multi-agent 有常见以下 4 种关系设定:
Fully cooperative: Agent collaborate to optimize a common return.
Fully competitive: One agent's gain is the other agent's loss.
Mixed cooperative & competitive: There are both cooperative setting and competitive setting.
Self-interested: Agents are self-interested. Their rewards may or may not conflict.
专业术语
State, Action, State Transition
There are \(n\) agents.
Let \(S\) be the state.
Let \(A^i\) be the \(i\)-th agent's action.
State transition:
The next state \(S'\) depends on all the agents' actions.
Rewards
Let \(R^i\) be the reward received by the \(i\)-th agent.
Fully cooperative: \(R^1 = R^2 = ⋯ = R^n\)
Fully competitive: \(R^1 ∝ -R^2\)
\(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\{A^j\}_{j ≠ i}\).
Returns
Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\).
Return (of the \(i\)-th agent):
Discounted return (of the \(i\)-th agent):
Here, \(γ ∈ [0,1]\) is the discount rate.
Policy Network
Each agent has its own policy network: \(π(a^i ∣ s; 𝛉^i).\)
Policy networks can be exchangeable: \(𝛉^1 = 𝛉^2 = ⋯ = 𝛉^n\)
Self-driving cars can have the same policy.
Policy networks can be nonexchangeable: \(𝛉^i ≠ 𝛉^j.\)
Soccer players have different roles, e.g., striker, defender, goalkeeper.
Uncertainty in the Return
The reward \(R_t^i\) depends on \(S_t\) and \(A_t^1, A_t^2, ⋯, A_t^n\).
Uncertainty in \(S_t\) is from the state transition, \(p\).
Uncertainty in \(A_t^i\) is from the policy network: \(π(⋅ ∣ S_t; 𝛉^i)\)
The return: \(U_t^i = ∑\limits_{k=0}^{∞} γ^k ⋅ R_{t+k}^i\) depends on:
all the future states: \(\{S_t, S_{t+1}, S_{t+2}, ⋯\}\);
all the future actions: \(\{A_t^i, A_{t+1}^i, A_{t+2}^i, ⋯\}\), for all \(i = 1, ⋯, n\).
State-Value Function
State-value of the \(i\)-th agent:
The expectation is taken w.r.t. all the future actions and states except \(S_t\).
Randomness in actions: \(A_t^j \sim \pi(\cdot \mid s_t; 𝛉^j)\), for all \(j = 1, ⋯, n\).
(That is why the state-value \(V^i\) depends on \(𝛉^1, ⋯, 𝛉^n\).)
One agent’s state-value, \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\), depends on all the agents’ policies.
If any agent changes its policy, then all of \(V^1, ⋯, V^n\) can change.
Example: soccer game.
A striker improves his policy, while everyone else’s policies are fixed.
His teammates’ state-values all increase.
The opposing players’ state-values all decrease.
Convergence
Convergence(收敛):无法通过改进策略来获得更大的期望回报。如果所有 agents 都找不到更好的策略,就说明已经收敛,可以中止训练了。
回顾一下 single-agent policy learning :
Policy network: \(\pi(a ∣ s; 𝛉)\).
State-value function: \(V(s; 𝛉)\).
\(J(𝛉) = 𝔼_S \left[ V(S; 𝛉) \right]\) evaluates how good the policy is.
Learn the policy network’s parameter, \(𝛉\), by \(\max\limits_{𝛉} \; J(𝛉)\).
Convergence: \(J(𝛉)\) stops increasing.
如果有多个 agents ,那么判断收敛的标准就是纳什均衡( Nash Equilibrium ):
While all the other agents’ policy remain the same, the \(i\)-th agent cannot get better expected return by changing its own policy.
Every agent is playing a best-response to the other agents’ policies.
Nash equilibrium indicates convergence because no one has any incentive to deviate.
Difficulty of MARL
多个 agents 会令训练变得更困难,直接套用 single-agent 的算法效果不好,可能不会收敛,因为:
The \(i\)-th agent’s policy network: \(\pi(a^i \mid s; 𝛉^i)\).
The \(i\)-th agent’s state-value function: \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\).
Objective function:
Learn the policy network’s parameter, \(𝛉^i\), by
注意到不同的 agent 的目标 \(\max_{𝛉^i} \; J^i(𝛉^1, ⋯, 𝛉^n)\) 并不相同:
这些 agents 各自有各自的目标函数,没有共同的目标,各自更新各自的参数。
这可能导致永远无法收敛。因为一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化,例如:
The \(i\)-th agent found \(𝛉_\star^i = \arg\max_{𝛉^i} \; J^i(𝛉^1, \cdots, 𝛉^n)\).
Now, another agent changes its policy.
So \(𝛉_\star^i\) is no longer the best policy of the \(i\)-th agent. The \(i\)-th agent has to find a new \(𝛉^i\).
The other agents’ objective functions will change, and therefore they will change their policies...
就这样所有的 agents 都在不停地改变自己的策略,可能永远都无法收敛。
Centralized and Decentralized
由于一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化,因而需要令 agents 之间能够进行通信。通信方式有两种:Centralized ,中心化,和 Decentralized ,去中心化。
Fully decentralized: Every agent uses its own observations and rewards to learn its policy. Agents do not communicate.
Fully centralized: The agents send everything to the central controller. The controller makes decisions for all the agents.
Agent 本身只负责发起动作,不负责决策;决策由 central controller 负责。
Centralized training with decentralized execution: A central controller is used during training. The controller is disabled after training.
Partial Observations
MARL 通常假设 Partial Observations ,部分观测,这是因为 agent 往往只能观测到局部的 state 而看不到全局的 state 。
An agent may or may not have full knowledge of the state, \(s\).
Let \(o^i\) be the \(i\)-th agent’s observation.
Partial observation: \(o^i \ne s\).
Full observation: \(o^1 = ⋯ = o^n = s\).
Fully Decentralized
训练过程:(和 single-agent RL 的训练过程完全一样)
训练结束之后,每个 agent 使用各自的策略网络来做决策。无论是训练还是推理,agent 之间
The \(i\)-th agent has a policy network (actor): \(\pi(a^i ∣ o^i; 𝛉^i)\).
The \(i\)-th agent has a value network (critic): \(q(o^i, a^i; \mathbf{w}^i)\).
Agents do not share observations and actions.
Train the policy and value networks in the same way as the single-agent setting.
This does not work well.
Fully Centralized
Centralized Training:
Centralized Execution:
比如 Centralized 应用于 actor-critic method :
Let \(𝐚 = [a^1, a^2, ⋯, a^n]\) contain all the agents’ actions.
Let \(𝐨 = [o^1, o^2, ⋯, o^n]\) contain all the agents’ observations.
The central controller knows \(𝐚\), \(𝐨\), and all the rewards.
The controller has \(n\) policy networks and \(n\) value networks:
Policy network (actor) for the \(i\)-th agent: \(\pi(a^i ∣ 𝐨; 𝛉^i)\).
Value network (critic) for the \(i\)-th agent: \(q(𝐨, 𝐚; 𝐰^i)\).
Centralized Training: Training is performed by the controller.
The controller knows all the observations, actions, and rewards.
Train \(\pi(a^i ∣ 𝐨; 𝛉^i)\) using policy gradient.
Train \(q(𝐨, 𝐚; 𝐰^i)\) using TD algorithm.
Centralized Execution: Decisions are made by the controller.
For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.
The controller knows \(𝐨 = [o^1, o^2, ⋯, o^n]\).
For all \(i\), the controller samples action by \(a^i ∼ π(⋅ ∣ 𝐨; 𝛉^i)\) and sends \(a^i\) to the \(i\)-th agent.
Fully Centralized 的缺点是执行速度慢:
All the agents send their observations to the central controller.
The central controller makes decisions, \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\), and sends \(a^i\) to the \(i\)-th agent.
王树森讲到这里的时候说了一句「中央亲自部署,亲自指挥……」,因此这一集的 AI 中文字幕没了,实在令環状線忍俊不禁。该句出现在这一集视频的 8 分 54 秒。
Communication and synchronization cost time.
Real-time decision is impossible.
Centralized Training with Decentralized Execution
比如对于 actor-critic method :
Each agent has its own policy network (actor): \(π(a^i ∣ o^i; 𝛉^i)\).
The central controller has \(n\) value networks (critics): \(q(𝐨, 𝐚; 𝐰^i)\).
Centralized Training: During training, the central controller knows all the agents’ observations, actions, and rewards.
Decentralized Execution: During execution, the central controller and its value networks are not used.
Centralized Training:
The central controller trains the critics, \(q(\mathbf{o}, \mathbf{a}; 𝐰^i)\), for all \(i\).
To update \(𝐰^i\), TD algorithm takes as inputs:
All the actions: \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\).
All the observations: \(\mathbf{o} = [o^1, o^2, \cdots, o^n]\).
The \(i\)-th reward: \(r^i\).
Each agent locally trains the actor, \(\pi(a^i \mid o^i; 𝛉^i)\), using policy gradient.
To update \(𝛉^i\), the policy gradient algorithm takes as input \((a^i, o^i, q^i)\).
训练之后就不再需要中央控制器了,每个 agent 独立跟环境交互。
Decentralized Execution:
Parameter Sharing
Policy networks: \(π(a^i \mid o^i; 𝛉^i)\), for \(i = 1, 2, ⋯, n\).
Value networks: \(q(𝐨, 𝐚; 𝐰^i)\), for \(i = 1, 2, ⋯, n\).
Trainable parameters: \(\{𝛉^i, 𝐰^i\}_{i=1}^n\).
Parameter sharing: \(𝛉^i = 𝛉^j\) and \(𝐰^i = 𝐰^j\), for some \(i\) and \(j\).
那么,是否应该共享参数呢?这取决于 agents 之间是否 exchangeable 。如果 exchangeable ,则可以共享参数;反之则不能。
作于 2026-4-9